Faster IPv4 WHOIS Crawling

The code base discussed in this blog can be found on GitHub
.

A few days ago I put together an IPv4 WHOIS crawler using Django, Redis and Kafka and launched it on what should have been a 51-node cluster on AWS EC2. While the cluster ran I was able to identify several shortcomings of the code and style of execution. Nonetheless, I could also see that there were good performance characteristics that could be further improved upon. These findings are discussed throughout my blog post ” Mass IP Address WHOIS Collection with Django & Kafka
“.

Yesterday I sat down for a few hours and made some architectural changes to the code base and took it for another spin.

Architectural Improvements

First, all workers will now find out if the IP address they’re looking up is within a CIDR block that has already been crawled on their own machine. This means the master node isn’t performing CPU-intensive lookups on behalf of 50 worker nodes. To do this, each worker will speak to a local Redis instance that is a slave of the Redis instance on the coordinator. When a WHOIS query is successful completed, the worker node passes the result to Kafka. The coordinator pulls out every unique CIDR block it sees in Kafka and stores them as a single string value in Redis. Redis then replicates that key across all the slave nodes. That key is then used in the CIDR hit calculations on each worker.

Second, all the Django-based processes on the worker nodes now run via Supervisor. If a process exits with an exception, a reasonable number of attempts are made to restart the process. If the exception is a one-off or a rare occurrence then the worker node can continue to be productive rather than just sit idle.

Third, all worker nodes pull their configuration settings from Redis. I can set the configuration keys via a management command on the coordinator and Redis will replicate them to each worker’s Redis instance. Workers are designed to wait and try again if they can’t get the Redis key with the HTTP endpoint for the coordinator and/or the Kafka host. When they can see both of those values they will then begin working. This makes deployment of the workers a lot easier as I don’t need to know any configuration settings for them in advance.

Fourth, I’ve generated the 4.7 million seed list of IPv4 addresses in advance in an sqlite3 file and push it to the coordinator after the coordinator has been deployed. This saves me time getting the coordinator up and running and gets the workers to work faster.

Fifth, I’ve created a management command to display aggregated telemetry so I can see overall progress when the cluster is running.

1 Coordinator, 1 Worker, Up & Running

To start I’ll add a rule to the
ip-whois-sg

security group to allow all EC2 instances within the group to speak to one another on Redis’ port 6379.

Then I launched two on-demand instances using the
ami-f95ef58a

Ubuntu 14.04 LTS image. The first instance is a t2.small for the coordinator. It has the public and private IP addresses of 54.171.53.151 and 172.30.0.239 respectively.

The second instance I launched was an on-demand t2.medium instance with the public IP address of 54.171.49.114. This instance will be setup as a worker, have an AMI image of it baked and then the instance will be terminated. The AMI image will then be used to launch 50 spot instances.

The last time I setup this cluster I used Ansible to provision each worker and a number of them didn’t provision properly even though multiple attempts were made. Not only is using an AMI more reliable, it’s much faster than Ansible and it’s 1,000s of SSH connections.

With those two instances launched I created a devops/inventory
file.

[coordinator]
coord1 ansible_host=54.171.53.151 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

[worker]
worker1 ansible_host=54.171.49.114 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

I then ran two SSH commands to add their ECDSA key fingerprints to my list of known hosts.

$ ssh -i ~/.ssh/ip_whois.pem 
      -o StrictHostKeyChecking=no 
      [email protected] 
      "test"

$ ssh -i ~/.ssh/ip_whois.pem 
      -o StrictHostKeyChecking=no 
      [email protected] 
      "test"

I then zipped up the code base so that Ansible will be able to deploy it to the instances.

$ zip -r 
    app.zip 
    ips/ *.txt 
    -x *.sqlite3 
    -x *.pid 
    -x *.pyc

With the zip file in place I was then able to run the Ansible-based bootstrap script.

$ cd devops
$ ansible-playbook bootstrap.yml

When that completed I checked the Supervisor-managed processes were running on the worker. It’s important that when a worker instance is booted up that these three processes start properly. When Supervisor is installed via apt it’ll install scripts to start itself when the machine is launched. Then, if the virtual environment wrapper script works properly and the code base is in place each of the three processes should launch correctly and consistently.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected] 
      'sudo supervisorctl status'
celerybeat                       RUNNING    pid 26436, uptime 0:01:08
celeryd                          RUNNING    pid 26435, uptime 0:01:08
get_ips_from_coordinator         RUNNING    pid 26437, uptime 0:01:08

The three processes above are configured for Supervisor in the devops/config/worker.supervisor.conf file. Those processes are:

  • get_ips_from_coordinator
    will take batches of 1,000 IPv4 addresses from the coordinator, see if each IP address hasn’t been found in an existing CIDR block and if not, find the registry for that address and queue up the WHOIS lookup.
  • celeryd
    runs the celery queues that look up WHOIS details on each of the five registries.
  • celerybeat
    will feed telemetry back to Kafka that will be picked up by the coordinator node.

Below is the supervisor configuration file.

[program:celeryd]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py celeryd --concurrency=30
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/celeryd.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu

[program:celerybeat]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py celerybeat
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/celerybeat.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu

[program:get_ips_from_coordinator]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py get_ips_from_coordinator
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/get_ips_from_coordinator.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu

With the worker behaving as expected I baked an AMI image called ‘worker’ and terminated the on-demand instance.

Pre-generated IPv4 Seed List

To avoid spending 25 minutes running a CPU-intensive IPv4 generation task on the coordinator, I ran the gen_ips
management command on my own, more powerful local machine.

$ python manage.py gen_ips

I then compressed the 109 MB db.sqlite3
database file my local instance of Django was using, uploaded it to the coordinator and decompressed it in place ready to go.

$ gzip db.sqlite3
$ scp -i ~/.ssh/ip_whois.pem 
      db.sqlite3.gz 
      [email protected]:/home/ubuntu/ips/
$ cd devops
$ ansible coordinator 
    -m shell 
    -a 'bash -c "cd /home/ubuntu/ips &&
                 gunzip -f db.sqlite3.gz"'

I then checked it was in the position I’m expecting it to be in and in the original, 109 MB form it was before.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected] 
      'ls -lh ips/db.sqlite3'
-rw-r--r-- 1 ubuntu ubuntu 109M Apr 29 18:55 ips/db.sqlite3

Launching 50 EC2 Spot Instances

With the coordinator already up I now needed to launch a cluster of 50 worker spot instances. The smallest type of spot instance I can launch is the m4.large. I bid a maximum of $0.02 / hour for each of the instances bringing my total cluster cost to a maximum of $1.028 / hour.

When I requested the spot instances I asked that they use the ‘worker’ AMI image I had baked. That way each of the spot instances would launch with all their software already in place and Supervisor can launch the three processes they need to run automatically on boot.

Within two minutes all of my spot instances had been provisioned and were running. I then collected the public IP addresses of each of the worker instances and added their ECDSA key fingerprints to my list of known hosts.

$ WORKER_IPS=$(aws ec2 describe-instances 
                  --query 'Reservations[].Instances[].[PublicIpAddress]' 
                  --output text |
                  sort |
                  uniq |
                  grep -v None |
                  grep -v '54.171.53.151')

$ for IP in $WORKER_IPS; do
      ssh -i ~/.ssh/ip_whois.pem 
          -o StrictHostKeyChecking=no 
          [email protected]$IP 
          "test" &
  done

I then re-wrote my devops/inventory
file replacing the original worker entry with the 50 new workers. I wish I’d written a fancy script for this task but instead I used some search/replace and column editing in my text editor to complete this work.

Here is the resulting devops/inventory
file:

[coordinator]
coord1 ansible_host=54.171.53.151 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

[worker]
worker1  ansible_host=54.171.109.146 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker2  ansible_host=54.171.109.215 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker3  ansible_host=54.171.109.55  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker4  ansible_host=54.171.114.203 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker5  ansible_host=54.171.115.48  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker6  ansible_host=54.171.118.226 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker7  ansible_host=54.171.119.195 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker8  ansible_host=54.171.120.62  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker9  ansible_host=54.171.129.29  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker0  ansible_host=54.171.139.137 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workera  ansible_host=54.171.142.194 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerb  ansible_host=54.171.152.199 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerc  ansible_host=54.171.158.140 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerd  ansible_host=54.171.159.0   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workere  ansible_host=54.171.174.252 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerf  ansible_host=54.171.175.16  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerg  ansible_host=54.171.175.180 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerh  ansible_host=54.171.175.225 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeri  ansible_host=54.171.176.62  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerj  ansible_host=54.171.177.14  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerk  ansible_host=54.171.177.213 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerl  ansible_host=54.171.208.177 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerm  ansible_host=54.171.209.128 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workern  ansible_host=54.171.210.135 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workero  ansible_host=54.171.210.4   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerp  ansible_host=54.171.212.94  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerq  ansible_host=54.171.222.148 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerr  ansible_host=54.171.222.249 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workers  ansible_host=54.171.224.201 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workert  ansible_host=54.171.226.27  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeru  ansible_host=54.171.51.109  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerv  ansible_host=54.171.51.188  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerw  ansible_host=54.171.52.148  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerx  ansible_host=54.171.52.212  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workery  ansible_host=54.171.54.52   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerz  ansible_host=54.171.55.140  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker11 ansible_host=54.171.56.152  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker12 ansible_host=54.171.57.251  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker13 ansible_host=54.171.69.208  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker14 ansible_host=54.171.69.4    ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker15 ansible_host=54.171.70.196  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker16 ansible_host=54.171.71.153  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker17 ansible_host=54.171.71.156  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker18 ansible_host=54.171.74.186  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker19 ansible_host=54.171.74.205  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker20 ansible_host=54.171.74.34   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker21 ansible_host=54.171.74.92   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker23 ansible_host=54.171.81.69   ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker24 ansible_host=54.171.82.207  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker25 ansible_host=54.171.83.114  ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

Please ignore the naming strategy, it’s not something I thought through.

Coordinator Services Up & Running

I’ve created a Django command that allows me to set the configuration the cluster needs by nothing more than giving the coordinator’s private IP address. This value is then used to set multiple Redis key/value pairs.

$ ansible coordinator 
    -m shell 
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 python manage.py set_config 172.30.0.239"'

I then launched the reference WSGI web server and collect_whois
process that monitors Kafka and creates the Redis key of all the unique CIDR blocks that have been seen across all the successful WHOIS queries.

$ ansible coordinator 
    -m shell 
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 nohup python manage.py runserver 0.0.0.0:8000 &"'
$ ansible coordinator 
    -m shell 
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 nohup python manage.py collect_whois &"'

With those in place I’ll tell each Redis instance across the cluster of worker nodes what the private IP address of the Redis master is. The worker nodes are already up and running but won’t begin to work till they can collect their configuration from their local Redis instance. The master Redis instance already has these configuration keys in place and once the slaves have this information replicated to them they will get started.

$ ansible worker 
    -m shell 
    -a "echo 'slaveof 172.30.0.239 6379' | redis-cli"

Cluster Telemetry

I have two primary commands that report back on the progress the cluster is making. The first shows the per-minute, per-node telemetry which can be found simply by following the ‘metrics’ Kafka topic.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected] 
      "/tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh 
      --zookeeper localhost:2181 
      --topic metrics 
      --from-beginning"

Here is an example output line (formatted and key-sorted for clarity).

{
    "Host": "172.30.0.12",
    "Timestamp": "2016-04-29T19:09:59.575451",

    "Within Known CIDR Block": 93,
    "Awaiting Registry": 1,
    "Found Registry": 135,
    "Looking up WHOIS": 10,
    "Got WHOIS": 191,
    "Failed to lookup WHOIS": 10
}

The second command collects the latest telemetry from each individual host seen in the ‘metrics’ topic and sums their values of each metric reported on. This lets me see a running total of the cluster’s overall performance.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected] 
      "cd /home/ubuntu/ips &&
       source /home/ubuntu/.ips/bin/activate &&
       python manage.py telemetry"

Here is an example output line (formatted and key-sorted for clarity).

{
    "Within Known CIDR Block": 1953,
    "Awaiting Registry": 47,
    "Found Registry": 2303,
    "Looking up WHOIS": 378,
    "Got WHOIS": 4080,
    "Failed to lookup WHOIS": 128
}

In the previous deployment of this cluster the coordinator was under heavy load from performing CPU-intensive CIDR hit calculations on behalf of all the worker nodes. I’ve since moved that task on to each of the worker nodes themselves. 45 minutes after the cluster was launched I ran top
on the coordinator and one of the workers to see how much pressure they were under.

The following is from the coordinator. As you can see it’s pretty quiet.

top - 19:46:17 up  1:14,  1 user,  load average: 0.04, 0.18, 0.25
Tasks: 108 total,   2 running, 105 sleeping,   0 stopped,   1 zombie
%Cpu0  :  4.5 us,  2.7 sy,  0.0 ni, 88.7 id,  0.7 wa,  0.0 hi,  2.7 si,  0.7 st
KiB Mem:   2048516 total,  1915612 used,   132904 free,   131260 buffers
KiB Swap:        0 total,        0 used,        0 free.   940036 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
29070 redis     20   0  538276 290828    984 S  2.2 14.2   0:59.54 /usr/bin/redis-server 0.0.0.0:6379
 4960 ubuntu    20   0 1902972 257788  12408 S  2.9 12.6   1:46.59 java -Xmx1G -Xms1G -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava+
30391 ubuntu    20   0  654672  46840   4700 S 19.7  2.3   3:35.93 /home/ubuntu/.ips/bin/python manage.py runserver 0.0.0.0:8000
29179 zookeep+  20   0 1244636  42356  11180 S  0.0  2.1   0:03.56 /usr/bin/java -cp /etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.j+
11192 rabbitmq  20   0  594416  40704   2464 S  0.0  2.0   0:06.21 /usr/lib/erlang/erts-5.10.4/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq+
30415 ubuntu    20   0  393460  37320   4264 S  3.2  1.8   1:48.28 python manage.py collect_whois
30386 ubuntu    20   0   84720  26176   4020 S  0.0  1.3   0:00.20 python manage.py runserver 0.0.0.0:8000
...

Here is one of the workers. It’s using a fair amount of CPU but not an excessive amount. The networking and CPU loads are now better balanced.

top - 19:47:08 up 45 min,  1 user,  load average: 1.37, 1.04, 0.87
Tasks: 146 total,   3 running, 143 sleeping,   0 stopped,   0 zombie
%Cpu0  : 30.7 us,  0.7 sy,  0.0 ni, 67.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 59.2 us,  0.0 sy,  0.0 ni, 40.5 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8175632 total,  1550272 used,  6625360 free,   136020 buffers
KiB Swap:        0 total,        0 used,        0 free.   242072 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1208 rabbitmq  20   0 1196892 111400   2564 S   0.7  1.4   0:21.47 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rab+
 1460 ubuntu    20   0  120696  45008   4944 R  88.7  0.6  28:55.69 python manage.py get_ips_from_coordinator
 1593 ubuntu    20   0  430460  44828   4160 S   0.0  0.5   0:01.47 python manage.py celeryd --concurrency=30
...

The Redis key that stores the list of CIDR blocks will keep growing as the cluster works it’s way through the workload. After an hour and 20 minutes of running the key’s value had reached 700,771 bytes in length.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected] 
      'echo "GET cidrs" |
       redis-cli |
       wc -c'

Every time the coordinator updates the ‘cidrs’ key Redis replicates it to all 50 slaves. The more CIDR blocks in that list, the longer it will take each worker node to find out if the IP address they’re about to look up is in that list or not. I would expect that the CPU and network usage would just eventually grow out of hand but the AWS CloudWatch charts showed the CPU on the worker nodes plateaued around 50% on average and the network usage, despite sending ~700 KB of data to 50 machines after every successful WHOIS lookup, was relatively low.

Getting Through The Workload

After an hour and 15 minutes the telemetry
management command was reporting a great deal of progress.

{
    "Within Known CIDR Block": 133181,
    "Awaiting Registry": 49,
    "Found Registry": 5447,
    "Looking up WHOIS": 946,
    "Got WHOIS": 103091,
    "Failed to lookup WHOIS": 14721
}

257,435 of the seed 4.7 million IPv4 addresses either have been or were being processed and 51% of those records required no external action. These performance metrics gave me a lot of confidence that if I were to spin up 200 spot instances to act as worker nodes that they could reliably perform their tasks and get through the lion’s share of work in 3 to 4 hours.

I did spot check the error log on one of the workers. AFRINIC and LACNIC were no longer responding to WHOIS requests but the other three registries were responding well.

$ tail -n12 ~/celeryd.log
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/154.126.120.129.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/196.36.216.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/200.0.36.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.16.144.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.229.204.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.240.12.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/177.28.132.1.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/187.248.216.1.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/186.230.24.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/186.61.132.65.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/187.57.48.129.
... ASN registry lookup failed.

Two Hour Cut-Off

My intention with this cluster was to see if I could improve both the reliability of the code running and improve upon the previous performance improvements. I decided to shut down the cluster before the two hour mark and collect the IPv4 WHOIS results. At an hour and 45 minutes, just before I shut down the cluster, this was the output from the telemetry
command:

{
    "Within Known CIDR Block": 176524,
    "Awaiting Registry": 49,
    "Found Registry": 65,
    "Looking up WHOIS": 527,
    "Got WHOIS": 128541,
    "Failed to lookup WHOIS": 18254
}

Collecting the Results

I ran the following to collect the WHOIS results off the coordinator.

$ ssh -i ~/.ssh/ip_whois.pem 
      [email protected]

$ /tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh 
    --zookeeper localhost:2181 
    --topic results 
    --from-beginning > results &

# Wait here till you see the results file stop growing.

$ gzip results

The results file is 240 MB when uncompressed and contains 129,183 lines of WHOIS results in line-delimited, JSON format.

How much IPv4 space was covered?

I ran two calculations on the data. The first was to find how many distinct CIDR addresses were successfully found. The answer is 50,751.

import json


results = [json.loads(line)
           for line in open('results').read().split('n')
           if line.strip()]

cidr = set([res['Whois']['asn_cidr']
            for res in results
            if 'Whois' in res and
               'asn_cidr' in res['Whois'] and
               res['Whois']['asn_cidr'] != 'NA'])

print len(cidr)

I then wanted to see how many distinct IPv4 addresses this represented and what proportion of the non-reserved IPv4 address space is covered.

from netaddr import *


print sum([IPNetwork(c).size
           for c in cidr if c])

The cluster managed to find WHOIS details on CIDR blocks representing 2,390,992,225 distinct IPv4 addresses covering over 64% of the entire non-reserved IPv4 address space.

稿源:Mark Litwintschik (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 后端存储 » Faster IPv4 WHOIS Crawling

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录