A Python Disaster

I had 30 docx files, each between 300 and 400 pages long (so between 9,000 and 12,000 pages total), with random headings thrown in and also bolded text that is supposed to be a heading but is, of course, like most things in life, completely without style.

My boss asked me to go through and apply heading styles only to the bolded text, and otherwise remove the stylings applied.

So, I could’ve gone in and done this manually, opened each of the 30 files and done a find and replace and apply style. OR:

A photo of a python (the animal, not the programming language) Python time! (The age of this python is a metaphor for my current abilities with the programming language. By Tigerpython (Own work) CC BY-SA 3.0, via Wikimedia Commons)

It took longer to make and run the script than it would’ve to have done it all manually (mostly because I made little errors).

And here’s the script:

import os, pypandoc, re

curDir = os.getcwd()

for file in os.listdir(curDir): 
	if file.endswith('.docx'):
		fileName = os.path.splitext(file)[0]
		output = pypandoc.convert_file(file, 'md', outputfile= fileName + "-FA.md")
		print file + " has been converted to markdown"
        #left myself console notes so I knew where I was; 
        #like I said, this script took a really long time to run

for mdFile in os.listdir(curDir): 
	if mdFile.endswith('.md'):
	    #now we have our markdown files,
    	#which are a little easier to parse than docx
		mdFileName = os.path.splitext(mdFile)[0]
		mdFile_opened = open(mdFile)
		mdFile_contents = mdFile_opened.read()
		mdFile_opened = open(mdFile, 'w')
		regex = "\n\*\*(.*?)\*\*\n"
		subst = r"\n# \1\n"
		mdFile_contents = re.sub(regex, subst, mdFile_contents)
        #replace bolded text on its own line
        #with h1's on their own lines
		mdFile_opened.write(mdFile_contents)
		mdFile_opened.close()
		print "Headings inside of " + mdFile + " adjusted!"
        
		revertdocx = pypandoc.convert_file(mdFile, 'docx', outputfile= mdFileName + ".docx")
		print mdFile + "converted back to DOCX"

Not sure where it goes wrong, but it does go wrong. It ran for 20 minutes (thankfully giving me updates of its process due to the helpful print to console commands I put in there). After the script stalled out on one of the files, I stopped and went with a manual approach. It was actually faster than I thought find/replacing on bold and applying H1, and then promoting and demoting headings to appropriately nest them. Only took like 10 hours. No big deal. Most of that time was spent wondering whether something was a sub-section of another.

Moral of the story: Make authors and editors determine heading hierarchy.

Painting

22 May 2017 | Painting | Podcasting | Franco A. Alvarado

This one was going to be about maps, but I accidentally painted a painting! Isn’t that fun? So as not to keep you in suspense, here it is:

Resets

15 May 2017 | Meta | Organization | Franco A. Alvarado

I believe I am still adjusting to this new way of working, this new way of living. I had a very productive weekend at the very least. I turned the kitchen island 90 degrees. It separates the space in a more pleasing way. My fianceé had realized that we had just created another galley kitchen by having the kitchen island parallel to the sink / oven / countertop. It’s a strange looking layout to be honest, but I tested it out today when I cooked three different meals at the same time. I made a chicken tikka masala, eggplant pasta, and pesto. Cooked for the week so I can also get back on my workout schedule.

Overload

08 May 2017 | Meta | Writing | Franco A. Alvarado

I had a lot of freelance work to do this week. It made me think about whether I should quit doing this blog, but then I realized that it’s probably fine. It’s more like a check-in for myself and my projects. Someone might stumble upon this, but it’s not adding anything terribly interesting just yet to the discussion of the different topics I’d like to get into.

Project Rotation

01 May 2017 | Python | Jekyll | Meta | Franco A. Alvarado

I have extra-curricular projects I like to work on, in additon to projects at work where I am a project manager and my freelance projects (which I am careful to make sure are never a conflict of interest, in case you were wondering). The projects can include any of the following:

Organizing Metadata in YAML

24 April 2017 | YAML | Metadata | Franco A. Alvarado

I have organized a large amount of content into a YAML document. This will aid in the contact automation I was talking about before. For the encyclopedia project I am working on, there are topics and entries. The editors relay to me the author’s information and what topic they will be writing about, which I am to cross-reference with a set of three documents.

A Python Disaster

17 April 2017 | Python | Pandoc | Regular Expressions | Franco A. Alvarado

I had 30 docx files, each between 300 and 400 pages long (so between 9,000 and 12,000 pages total), with random headings thrown in and also bolded text that is supposed to be a heading but is, of course, like most things in life, completely without style.

Yet another scheduling Jekyll posts post

10 April 2017 | Jekyll | Python | Meta | Franco A. Alvarado

I’ve been trying to research how to schedule these Jekyll blog posts so I am one of those great Internet unicorns that regularly updates their blog. If you don’t know, Jekyll is a static site generator, but it is pretty involved in its setup.

Automating contracts

03 April 2017 | LaTeX | Pandoc | docx | Franco A. Alvarado

I have to juggle multiple authors and their contracts, so I thought it’d be better to automate those contracts so I can prevent errors and have uniformity throughout. Ideal workflow: