

The fastest datastore has its query time highlighted in green. SELECT eventFile, AVG(eventTime), AVG(multiplicity),įROM t WHERE eventnumber > 20000 GROUP BY eventFile SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000 WHERE eventnumber > 525000 GROUP BY eventFile, eventTime SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile SELECT eventFile, count(*) FROM t GROUP BY eventFile SELECT count(*) FROM t WHERE eventnumber > 500000

SELECT count(*) FROM t WHERE eventnumber > 20000 SELECT count(*) FROM t WHERE eventnumber > 1 ( read more) Queries speedĪfter loading the same exact dataset in each big datastore I have tested the query time of a few sample queries against each one.Īs expected and since none of the datastores were MongoDB, the number of rows and results were consistent across each datastore. It allowed Redshift to allocate the different files to different nodes and load them in parallel, which dramatically improved performance. For the purpose of this test it was easier to load a large dataset from S3 / Google Cloud Storage but in most cases data will be streamed directly from an application.ĭuring my last round of tests with Redshift I split my CSV in 50 2GB files instead of using one 1TB file as shown above. The above information may or may not be relevant in your use case. The times below reflects the load time from S3 / Google Cloud Storage to the datastore and not the time it took to transfer the file from a server to both cloud storage solutions. To calculate load time I initially sent the files to both Amazon S3 and Google Cloud Storage then loaded them into each datastore.
#Aws million song dataset import to redshift update
This is the first update of the article and I will try to update it further later. 6/22/17 updateĪlmost 3,000 people read the article and I have received a lot of feedback.

The benchmark below has been updated with these new tests. With Joe Harris' help (he is a Redshift Database Engineer at AWS), I measured the performance of an optimized schema on both dc1.large and ds2.xlarge Redshift instances. Following Tino Tereshko's advice (he is Big Data Lead at Google Cloud Office of CTO), I added the metrics for BigQuery standard SQL and re-calculated the data loading time (from Google Cloud Storage to BigQuery) following their recent optimizations.
