TardyParty: Automating Google Groups with Python and PHP

After months of excellent idea generation on Oxidized Bismuth Blogger (a mailing list for short-form musings and wacky business ideas) last week the OxBiz crew held its first meetup/hike in Redwood Regional Park. One idea that came out of the meetup was a weekly summary proclaiming who had submitted ideas that week and who had failed to post. This email would hopefully shame delinquent list members into posting more frequently.

Down the rabbit hole

Code already exists for checking public RSS feeds and reporting back who posted during the week. The OxBiz code would need to be a little different. The general requirements for the project, which I will lovingly call TardyParty, are:

  • Check a private RSS feed from Google Groups
  • Match names from email posts to the list of OxBiz participants. For example, the script should be able to match “B. Gleitzman” -> “Benjamin Gleitzman”
  • Generate a pre-formatted message and send it to the group

The first task turned out to be harder than expected. Google Groups has no API but they do provide a private RSS feed of the most recent posts. I used a modified version of the GoogleGroups2Rss project (in PHP, unfortunately) to programmatically log into Google, capture the required cookies, and ultimately access the private RSS link. I was surprised to discover that Mac OSX comes with PHP/Apache pre-installed

Once the feed was captured I cross-referenced the names from each post to the canonical list for the group. I found difflib’s get_close_matches provided decent enough fuzzy matching.

The last step is sending a message via Gmail using SMTP which is well-documented here.

If you’re interested in reading the code, you can check out the entire project on github. I’ve also listed the meat and potatoes of the project below:

import difflib
import re
import time
import urllib2
import xml.etree.cElementTree as ET

from datetime import datetime, timedelta

GOOGLE_TIME = '%a, %d %b %Y %H:%M:%S UT'
XML_FETCHER_URL = 'http://localhost/GoogleGroups2Rss/index.php?group=oxidized-bismuth-blogger'

result = urllib2.urlopen(XML_FETCHER_URL).read()
time.sleep(1)
tree = ET.ElementTree(file='/tmp/text.xml')

people = ['Andrew Van Dam', 'Chris Maury', 'Evan Burchard', 'Benjamin Gleitzman', 'Jam Kotenko', 'Jason Kotenko', 'Kendall Webster', 'Nicole', 'Parker Higgins', 'Rich Jones', 'Zachary Adam Ozer']

now = datetime.now()
monday = now - timedelta(days=now.weekday())
last_monday = monday - timedelta(days=7)

people_who_wrote = {}

items = tree.getroot()[0][4:] # start of messages
name_re = re.compile("\((.*)\)") # match name inside parens
for item in items:
    item_pubdate = datetime.strptime(item[5].text.strip(), GOOGLE_TIME)
    if item_pubdate > last_monday and item_pubdate < monday:
        name = name_re.search(item[4].text.strip()).groups()[0]
        if name not in people_who_wrote:
            canonical_names = difflib.get_close_matches(name, people)
            if canonical_names:
                canonical_name = canonical_names[0]
                people.remove(canonical_name)

        info_dict = {'title': item[0].text.strip(),
                     'url': item[1].text.strip()}
        people_who_wrote.setdefault(name, []).append(info_dict)

print 'People who wrote are', ', '.join(people_who_wrote.keys())
print 'People who didn\'t write are', ', '.join(people)