this past couple of weeks (or so... one day runs into another and I lose all track of time) I've been fighting to hold onto a job that could be big money if I don't get lost in the woods. the client's client is using Pandas to sling data around, but his scripts are all single-threaded and only utilize a single processor. so he wants it ported to Spark. but Spark is buggy -- version 2.0.2, which is what Amazon offers by default on their EMS clusters, has a bug in dataframe.join() which breaks any left outer joins (at least) given certain parameters of which I haven't tracked down yet, but certainly my code (ported directly from his) triggered it. and 2.0.2 is only about 5 months old. so we're talking really basic stuff not working only 5 months ago -- hardly a mature platform. 2.1.0 fixes that, but there's still a bug in that that makes calculated columns (dataframe.withColumn(...)) disappear on a following join. so you have to write the data out to disk and read it back in before you do the join! that forces the calculations to actually be performed. I'm sure there's an easier way, but that's the way the client's client did it and I'm just learning it as I go.

but stepping back, getting in the cockpit and flying over the problem at ten miles up, I'm looking at all this shit and shaking my head. this isn't programming. I've been saying for a while now that all these frameworks are an impediment to getting work done. people go from storing data in flat files to SQL databases, then on to NoSQL and S3. over a period of several decades, we've gone full circle! instead of using the filesystem to store and reference data, we're now using key-value stores, which accomplish the same thing! even the pseudo-file structure of keys on S3 match a POSIX file pattern.

some user on StackOverflow (I think it was abarnert but can't find the thread at the moment) that simple Unix pipelining makes the best use of all processors and RAM. string together a bunch of scripts that each do one thing and do it well, feeding the output to the next script. if this client gives me the leeway, I can do that with this problem. for example, all a "left outer join" does is copy all columns from one table to another (appending them to the end) if a certain condition applies (such as, in this case, having the same value for the same key column name). I can script that using just a few lines of Python with the CSV module. for that matter, I could pipeline all the input CSV files through a filter first that changes the separator character to something that is close to 100% certain not to appear in business spreadsheets (such as this?), and just split on that in all downstream pipes, eliminating that module.

one advantage of this pipeline approach is that you can, by various methods, set the process name to something that indicates exactly what the script does, like convert_cp1252_to_utf8 or left_outer_join. so when you start up top, instead of a hundred lines of java or python, you see exactly what's going on!

my inner tinfoil-hat-conspiracy-nut thinks that these frameworks are simply a big scam. I'm guessing they're being pushed by colleges to CS majors, and by big corporations, whose decision-makers are being wined and dined by the companies who stand to make big money training and supporting these bloated pieces of bug-ridden software. they probably hook them at conferences and conventions and such... I don't know, I've been out of the loop for almost 2 decades already.

but what I do know is, most of my coding now isn't coding at all, it's trying to figure out what the fuck the framework is doing to my data, and trying to find a workaround for it. I want to get back to programming.

Back to blog or home page

last updated 2017-01-27 11:14:09. served from tektonic.jcomeau.com