Anyone Using Apache Spark, Arrow or PySpark for CSV Processing?

1 post / 0 new
Last seen: 6 months 2 days ago
Joined: 2012/10/17 - 03:21
Anyone Using Apache Spark, Arrow or PySpark for CSV Processing?

Hello All,

I've been really busy this year with with a new job and a multitude of other things that have kept me from radio-related activity, and I am probably way out of touch with what's happening on the data front. In any case, I started looking at spot velocity and was astonished to see the growth. Needles to say, my old ways of processing just won't cut it. Given that, I started messing around with the tools I use at work to see what I could muster up.

The Goals Are Pretty Simple

  • Reduce WSPR CSV on-disk storage space
  • Improve CSV read times and portability
  • Improve query performance
  • Produce a cost effective, yet scalable, analytics solution for WSPR data

Some Simple Tests

I wrote up a couple Python Scripts and compared them with some Spark-Scala Apps that perform the same tasks. At preset, they only perform basic conversions from CSV to Apache Parquet, but, the performance increase is impressive.

For perspective, the Scala Apps were able to parse the 2020-02 CSV file and create partitioned data in less than 30sec. Adding too that, a Top-Ten query count by the reporting station only took 1.7sec. I don't know if this is Good, Bad, or middle of the road compared to what other folks are doing on the WSPR Data Reporting Scene.

On-Disk storage was reduced by a ratio of approximately 8 to 1 in some of the tests I ran. For example, the 2020-02 CSV file weighs in at about 870 MB with gz compression, and 4 GB decompressed. The Parquet format runs right at 632 MB partitioned and 432 MB in a single file Brotli format. I've not done much testing with Avro yet, but that will be next.

In any case, I'd be interested in hearing how folks are processing the CSV files these days, as it's a different game than it was just a year or so ago.

Greg, KI7MT