My guess is that in the past there was no real use for these formats unless you already had a Hadoop cluster running. If Amazon wants these "Hadoop as a service" concepts to take off it seems like it would be wise for them to make it easier to get data onto S3 in a better format than CSV.
"You're using it wrong"
"You should start up a Hadoop cluster, create a table, load the CSV into that table, then export it as Parquet, and then load that into S3, so that Athena can scan it"
Wouldn't you be honestly just better off creating a table and loading the CSV into a columnar database directly, like on Amazon RDS (Even if it means you bring the server instance down after you're done)?
And his problem with the field delimiter really shows lack of experience; of course you can't use a multi-length field delimiter. I've never seen anyone use a comma+space for a delimiter. That file could probably be about 75% of the original file size if he re-created it with just the comma as the field delimiter.
Sorry to bash on the author - don't mean to sound harsh but a lot of people are trying to do benchmarks but with minimal context and whatnot
It is indeed a non conventional format but thats how the source CSV was formatted on https://sdm.lbl.gov/fastbit/data/samples.html
It would also be helpful to see what schema you used for redshift, specifically the encoding and the distribution and sort key(s).
To give Athena more of a fighting chance, other people have mentioned Parquet or ORC, but also remember to partition the data. Generally you're supposed to give Athena a directory with data partitioned into different subfolders based on field values. Like if you're dealing with time-series type data you can partition your data in the format "year=<xxxx>/month=<yy>/day=<zz>/<uuid>.csv". I'm guessing you should do that for this data partitioning by eventTime but it kind of depends. Of course then you need some other component to put the data into S3 into the desired format, and you should probably count that as part of the loading time.
DISCLAIMER: work for AWS, not on the redshift or athena teams, though I do use redshift for work.
From direct experience of working on a moderate Redshift cluster performance is horrific once you start querying tables outside of their designed with sort and distribution keys. This is fine when you have defined use but when you're doing ad-hoc analysis or want to query across lots of dimensions we couldn't make Redshift work.
In contrast, I've seen BigQuery perform much more consistently in the face of large datasets with varying workloads.
I would be interested in seeing the settings for the table in Redshift. I don't know much about how BigQuery works, but Redshift relies heavily on distribution and sort keys to efficiently query data.