phkasce.blogg.se - Aws million song dataset import to redshift

#Aws million song dataset import to redshift update

The fastest datastore has its query time highlighted in green. SELECT eventFile, AVG(eventTime), AVG(multiplicity),įROM t WHERE eventnumber > 20000 GROUP BY eventFile SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000 WHERE eventnumber > 525000 GROUP BY eventFile, eventTime SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile SELECT eventFile, count(*) FROM t GROUP BY eventFile SELECT count(*) FROM t WHERE eventnumber > 500000

SELECT count(*) FROM t WHERE eventnumber > 20000 SELECT count(*) FROM t WHERE eventnumber > 1 ( read more) Queries speedĪfter loading the same exact dataset in each big datastore I have tested the query time of a few sample queries against each one.Īs expected and since none of the datastores were MongoDB, the number of rows and results were consistent across each datastore. It allowed Redshift to allocate the different files to different nodes and load them in parallel, which dramatically improved performance. For the purpose of this test it was easier to load a large dataset from S3 / Google Cloud Storage but in most cases data will be streamed directly from an application.ĭuring my last round of tests with Redshift I split my CSV in 50 2GB files instead of using one 1TB file as shown above. The above information may or may not be relevant in your use case. The times below reflects the load time from S3 / Google Cloud Storage to the datastore and not the time it took to transfer the file from a server to both cloud storage solutions. To calculate load time I initially sent the files to both Amazon S3 and Google Cloud Storage then loaded them into each datastore.

All columns are either integers, double precision or floats.

This article is a basic comparison on data loading and simple queries between Google BigQuery and Amazon Redshift and its cousin Athena.įor this test we will be loading a CSV/Parquet file which is basically an enlarged version of the STAR Experiment star2002 dataset. I will definitely share a link to their article if they publish one!

I added a links section with useful articlesįinally, a few people reached out asking for the dataset to try to load it and benchmark the performance on other databases.

Using Redshift admin tables I was able to add the data scanned per query for Redshift (Thanks rockostrich).

I converted the CSV format to Parquet and re-tested Athena which did give much better results as expecte (Thanks Rahul Pathak, Alex Casalboni, openasocket, Robert Synnott, the amazon Redshift team with Joe Harris, Jenny Chen, Maor Kleider and the Amazon Athena/EMR team with Abhishek Sinha).

#Aws million song dataset import to redshift update

This is the first update of the article and I will try to update it further later. 6/22/17 updateĪlmost 3,000 people read the article and I have received a lot of feedback.

The benchmark below has been updated with these new tests. With Joe Harris' help (he is a Redshift Database Engineer at AWS), I measured the performance of an optimized schema on both dc1.large and ds2.xlarge Redshift instances. Following Tino Tereshko's advice (he is Big Data Lead at Google Cloud Office of CTO), I added the metrics for BigQuery standard SQL and re-calculated the data loading time (from Google Cloud Storage to BigQuery) following their recent optimizations.