hardcopy

November 08, 2017

As a freshman at Rutgers, I brought my self-built desktop computer with an aging copy of Microsoft Office 2011. It quickly became a pain only being effective in my room when I could use my desktop computer, so my sophomore year I bought a Macbook to see if I could benefit from using a *nix-like system since I was already comfortable with bash on my Windows computer via cygwin. I transistioned to doing almost 95% of my college reports, slides, etc. on Google Drive.

Here's a tree-like example of my directory structure:

~/Documents/school
|--2017-Spring-RU
|	|--Digital Logic Design
|	|	|--syllabus.gdoc
|	|	|--report-1.gdoc
...

And the setup worked really well! I had some decent filename conventions and some useful symlinks that mirrored my folders across Dropbox and Google Drive (and maybe I'll add self-hosted OwnCloud, in the future) for cloud redundancy. But at the end of the semester, I realized there's the problem with doing everything in Google Drive: archiving those files is a huge pain.

Let me explain: when I completed a semester, I cleaned my school folder by zipping up each folder and transferring that to my portable SSD. However, Google Drive files are hosted online, and the locally stored*"files"* are just url pointers to the actual file online. To get the actual content, you need to export the file to a common file format. For one report I worked on last week, it's not too hard to do. For 50+ documents accumulated over an entire semester? yeeeeeaaaaaahhhhh, no.

So I did what any self-respecting programmer would do: I automated it. I wrote hardcopy - a command line utility that traverses a given directory (presumably a Google Drive folder/subfolder) and converts Google Document/Sheet/Slide files to their respective offline equivalents( .pdf's and .xlsx's). I used Node.JS and the Google Drive npm packages to do this. I chose Node because I ran into Commander (a nifty npm package for building CLIs) and it really made me want to build CLIs since it was so easy with that package.

There's a bit of acrobatics you need to go through to obtain the right Google Drive API keys, but once that's done. It goes something like this:

$ hardcopy ~/Google Drive/school/2017-RU-Spring
Found "Digital Logic Design/report-1.gdoc"
...

And so on. It's not the prettiest, most user-friendly CLI ever, but it works for me. It exports the files from <filename>.gdoc to <filename>.gdoc.pdf so I know which files were originally Google Documents.

My initial version was pretty fast (totally baseless, didn't actually run any benchmarking lol) and it could export about ~20 documents in a minute or so.

I was curious to see if I could utilize multithreading to speed up file export, so I rewrote the CLI in Python - figure'd it would be worth a shot. However, I ran into a byte streaming issue with the Google Drive library for Python that hadn't been fixed, so I was left with single-threaded, blocking file export version of hardcopy. Needless to say, it was a lot slower than the non-blocking Node version.

So yeah, hardcopy and hardcopy-python are both hosted on my Github. Check 'em out, clone/fork/do what you want to them. I do plan to revisit the Python version and see if I can get the multithreading to work - hopefully the recent work in turning Google Drive into Google Backup + Storage fixed the chunk streaming issue. Probably not, but who knows :)?

I'm open to a limited number of consultancy projects; if you think I can help you out, let's chat! When I'm free, I write about engineering, technology, HCI and the future of computing. If you like what I have to say, feel free to buy me a coffee and/or subscribe below for updates.