1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

On November 17th, 2015, Todd Schneider published a blog post titled Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
in which he analysed the metadata of 1.1 billion Taxi journeys made in New York City between 2009 and 2015. Included with this work was a link to a GitHub repository
where he published the SQL, Shell and R files he used in his work and instructions on how to get everything up and running. There are a few additional charts created by the R files which were used in follow up posts as well.

In this blog post I’ll launch 4 different types of AWS RDS instances running PostgreSQL 9.5.2 and benchmark creating the same graphs that Todd Schneider did in his analysis.

PostgreSQL on AWS RDS Up & Running

I’ll launch 4 RDS instances. They will all be running PostgreSQL 9.5.2 in
eu-west-1a

. Each will have 400 GB of General Purpose SSD capacity and will not be replicated across additional availability zones. Their maintenance windows have been set so they will not run during this benchmark. They will each have their own VPC.

The instance types are as follows:

  • DB1 is a db.t2.large
    which costs $5.50 / day inc. storage costs for RDS alone.
  • DB2 is a db.r3.large
    which costs $8.52 / day inc. storage costs for RDS alone.
  • DB3 is a db.m4.large
    which costs $6.60 / day inc. storage costs for RDS alone.
  • DB4 is a db.m4.xlarge
    which costs $11.53 / day inc. storage costs for RDS alone.

The instances each took 5 minutes to launch.

Each RDS instance is paired with a t2.large
EC2 instance running Ubuntu 14.04.3 LTS that also sit in
eu-west-1a

. These instances will run the data import and R-based analysis scripts. Each of these instances has 500 GB of General Purpose SSD (GP2) capacity and 1,500 IOPS each at their disposal. The instances are graded for low to moderate network performance. Each of these instances cost $4.57 / day including storage.

Downloading 1.2 Billion Taxi Trips

In this benchmark I’ll be downloading all the data fresh and importing it into each RDS instance individually from each paired EC2 instance. The original dataset of 1.1 billion taxi journeys has grown with the release of the last half of 2015’s data and is now has a little over 1.2 billion journeys.

The following are the commands I ran to bootstrap each of the EC2 instances and download the 300+ GB of data.

$ sudo apt-get update
$ sudo apt-get install 
      build-essential 
      git 
      postgis 
      postgresql-client 
      ruby 
      ruby-dev 
      unzip 
      zlib1g-dev

$ sudo gem install 
      activesupport 
      roo 
      rest-client
$ git clone https://github.com/toddwschneider/nyc-taxi-data.git
$ cd nyc-taxi-data
$ cat raw_uber_data_urls.txt 
      raw_data_urls.txt | 
      xargs -n 1 -P 6 
          wget -P data/

Importing 1.2 Billion Taxi Trips

In PostgreSQL
nyc-taxi-data

is perfectly fine as a database name but on RDS the database name can only contain letters, numbers and underscores. For this reason I named the databases trips
. In order for the shell and ruby scripts to be able to work with this new database name I patched them with the following:

$ sed -i 's/psql nyc-taxi-data/psql trips/' 
      {*.sh,tlc_statistics/import_statistics_data.rb}
$ sed -i 's/psql -d nyc-taxi-data/psql -d trips/' *.sh
$ sed -i 's/createdb/#createdb/' *.sh

I then created environment variables on each EC2 instance for their respective RDS pairings.

$ read PGHOST
$ read PGUSER
$ read PGPASSWORD

$ export PGHOST
$ export PGUSER
$ export PGPASSWORD

I then ran the database initialisation, population and statistics generation scripts.

$ time (
      ./initialize_database.sh;
      ./import_trip_data.sh;
      ./import_uber_trip_data.sh;
      cat analysis/prepare_analysis.sql 
            tlc_statistics/create_statistics_tables.sql | 
            psql trips;
      cd tlc_statistics;
      ruby import_statistics_data.rb
  )

The following were the durations I observed:

  • The db.t2.large
    instance took 60 hours, 38 minutes and 42 seconds costing $13.98 excluding EC2 costs.
  • The db.r3.large
    instance took 71 hours, 29 minutes and 27 seconds costing $25.56 excluding EC2 costs.
  • The db.m4.large
    instance took 62 hours, 13 minutes and 47 seconds costing $17.05 excluding EC2 costs.
  • The db.m4.xlarge
    instance took 51 hours, 43 minutes and 56 seconds costing $24.98 excluding EC2 costs.

R Up & Running

The following was run to install R and various other dependencies for the reports Todd Schneider wrote.

$ echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | 
    sudo tee -a /etc/apt/sources.list
$ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
$ gpg -a --export E084DAB9 | sudo apt-key add -
$ sudo apt-get update

$ sudo apt-get install 
    git 
    libgdal-dev 
    libpq-dev 
    libproj-dev 
    r-base 
    r-base-dev
$ mkdir -p $HOME/.R_libs
$ export R_LIBS="$HOME/.R_libs"

$ echo 'requirements = c("ggplot2",
                         "ggmap",
                         "dplyr",
                         "reshape2",
                         "zoo",
                         "scales",
                         "extrafont",
                         "grid",
                         "RPostgreSQL",
                         "rgdal",
                         "maptools",
                         "gpclib")

        sapply(requirements,
               function(x) {
                    if (!x %in% installed.packages()[,"Package"])
                        install.packages(x, repos="http://cran.r-project.org")})' | 
  R --no-save

Before running the analysis.R
file I needed to patch the database connector to use the environment variables I set earlier rather than the hard-coded localhost setup.

$ cd ~/nyc-taxi-data/analysis/
$ vi helpers.R

The following line in helpers.R
:

con = dbConnect(dbDriver("PostgreSQL"), dbname = "nyc-taxi-data", host = "localhost")

Was replaced with the following:

con = dbConnect(dbDriver("PostgreSQL"),
                dbname = "trips",
                host = Sys.getenv('PGHOST'),
                user = Sys.getenv('PGUSER'),
                password = Sys.getenv('PGPASSWORD'))

Benchmarking RDS

I ran the analysis.R
script three times from the respective EC2 pairs. Each of the results has been rounded to the nearest second.

$ time (cat analysis.R | R --no-save)

The db.t2.large
instance reported the following times:

  • Run 1: 5 minutes 21 seconds
  • Run 2: 5 minutes 25 seconds
  • Run 3: 5 minutes 39 seconds

The db.r3.large
instance reported the following times:

  • Run 1: 6 minutes
  • Run 2: 5 minutes 30 seconds
  • Run 3: 5 minutes 34 seconds

The db.m4.large
instance reported the following times:

  • Run 1: 5 minutes 43 seconds
  • Run 2: 5 minutes 34 seconds
  • Run 3: 5 minutes 33 seconds

The db.m4.xlarge
instance reported the following times:

  • Run 1: 5 minutes 34 seconds
  • Run 2: 5 minutes 26 seconds
  • Run 3: 5 minutes 41 seconds

For this workload the reporting speeds don’t line up well with the price differences between the RDS instances. I suspect this workload is biased towards R’s CPU consumption when generating PNGs rather than RDS’ performance when returning aggregate results. The RDS instances share the same number of IOPS each which might erase any other performance advantage they could have over one another.

As for the money spent importing the data into RDS I suspect scaling up is more helpful when you have a number of concurrent users rather than a single, large job to execute.

稿源:Mark Litwintschik (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 后端存储 » 1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录