bd808.com

Making Django migrations that work with MySQL 5.5 and utf8mb4

2017-04-17T04:18:24+00:00

I like Django, MySQL, and Unicode, but getting all three to play together nicely can sometimes be a bit challenging. One of the more annoying things is getting Django to make a migration that will create a 255 character CharField that is encoded using the utf8mb4 character set and indexed.

Out of the box, MySQL's InnoDB table type has a maximum index length or 767 bytes. This is enough to index 255 characters in the utf8 encoding, but that encoding won't work for storing any Unicode data from the Supplementary Multilingual Plane. That means you can't put a unicorn face (🦄) or a slice of pizza (🍕) into a column using this encoding. Changing to the utf8mb4 character set will allow you to store four byte code points, but only index 191 characters.

With MySQL 5.5.14 and later you can raise this limit to 3072 bytes by using the innodb_large_prefix setting along with the Barracuda file format, file per table storage, and dynamic row format. The first three can all be set server wide, but the row format for the table needs to be provided in a CREATE TABLE or ALTER TABLE statement as a ROW_FORMAT=DYNAMIC attribute.

Django does not have a feature flag or setting for adding the needed attribute. I've worked around this before by using manual database hacking, but today I figured out a hack that you can manually apply to your Django migration files to work around it. The trick is to edit the migration so that the initial field creation uses a length that will fit in the 767 byte limit, and then add a RunSQL to change the table's row format and an AlterField to increase the field length.

 operations = [
     migrations.CreateModel(
         name='UnicodeHack',
         fields=[
             ('hack', models.CharField(unique=True, max_length=128)),
         ],
     ),
     migrations.RunSQL('ALTER TABLE unicodehack ROW_FORMAT = DYNAMIC;'),
     migrations.AlterField(
         model_name='unicodehack',
         name='hack',
         field=models.CharField(unique=True, max_length=255),
     ),
 ]

SASL auth with python-irc

2017-03-01T06:48:04+00:00

I maintain a couple of IRC bots that help out with Wikimedia devops tasks. Jouncebot was a bot I started helping with when @mattofak moved on to other projects. Later I developed Stashbot as a replacement for using the Logstash that collected data for my SAL tool in Tool Labs.

Both bots are built using the awesome irc python library from Jason Coombs. I've copied various core irc behaviors from one bot to the other as I've discovered and fixed various bugs in how I was using the library. I finally got around to extracting these core parts into a Python library of it's own that I have named "IRC Bot Behavior Bundle" or IB3 for short.

The IB3 library provides a collection of mixin classes that can be used to extend an irc.bot.SingleServerIRCBot instance to do things like:

Encrypt connections using SSL
Authenticate to Freenode
Join channels slowly to avoid flood bans
Ping the upstream IRC server to check for connection liveness
Rejoin channels when kicked
Regain primary nickname after receiving a ERR_NICKNAMEINUSE message

All of these behaviors are pretty battle tested from months/years of use in one or the other of my bots.

IB3 has one sexy new addition, SASL PLAIN authentication. SASL is an IRC v3 protocol extension that allows a client to authenticate at the time of connection. This method lets you authenticate before your connection becomes visible to other clients on the server. It also seems to be a bit faster than the normal exchange with NickServ.

Making a basic bot that uses SASL auth is pretty easy using the library:

# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
# more details.
#
# You should have received a copy of the GNU General Public License along with
# this program.  If not, see <http://www.gnu.org/licenses/>.

import ib3.auth
import irc.bot

NICKNAME='your account name here'
PASSWORD='your password here'
CHANNELS=['##sasl_test']

class ExampleSaslBot(ib3.auth.SASL, irc.bot.SingleServerIRCBot):
    # Add your ``on_*`` handlers here
    pass

bot = ExampleSaslBot(
    server_list=[('chat.freenode.net', 6667)],
    nickname=NICKNAME,
    realname=NICKNAME,
    ident_password=PASSWORD,
    channels=CHANNELS,
)
bot.start()

The ib3.auth.SASL mixin will take care of these things for you behind the scenes:

Send CAP REQ :sasl as soon as SingleServerIRCBot knows it has connected
Listen for a CAP ACK :sasl response from the server
Send an AUTHENTICATE PLAIN message to start the auth handshake
Wait for an AUTHENTICATE + response
Send AUTHENTICATE <base64 encoded 'username\0username\0password'> SASL PLAIN request
Wait for a 903 :SASL authentication successful response
Send a CAP END message to finish the handshake

Both Jouncebot and Stashbot have been using this code for a few weeks with no problems yet. If you try it out and find issues, please report a bug and I'll see if I can figure out how to make things work better.

Switching from Octopress to Pelican

2017-03-01T04:19:06+00:00

I fell down a rabbit hole a few days ago. I wanted to write a blog post about my new irc library, but the rbenv I had setup to run Octopress was all messed up. I stated poking around to try and remind myself how to get it working again and eventually decided that I should really look for a new static site generator written in language that I like to use. I ended up picking Pelican.

There are plenty of blog posts already that cover the basics, so I won't try to give a complete walk through. I mostly used the guides by Jake Vanderplas and Jinghao Shi along with the manual. I have blogged before about how I setup Octopress to make GitHub issues for comments. I ported this functionality to Pelican with a couple of commits:

bd808/bd808.github.com@412c0b3 adds the javascript and template changes needed to render comments and a link to the GitHub issue for a given post.
bd808/bd808.github.com@fe45c78 adds a new_post target to my fabric file which creates an issue in the GitHub project and adds the needed metadata to a stub Markdown file.

Puppet file recurse pitfall

2014-09-30T20:44:13-06:00

Puppet has become my go to system management tool in no small part because it is the tool that the operations group at $DAYJOB has standardized on for our production infrastructure management. It took quite a while for me to get the hang of how Puppet does what it does, but today I'd say I'm a fairly decent Puppet programmer. Every once in a while however I stumble on something new and surprising.

A couple of weeks ago I got an interesting bug report from a user about a collection of Puppet manifests I help manage. The bug was that his testing server was pegged at 99% CPU utilization for multiple minutes during each puppet agent run. The bug reporter did a great job of investigating and had also found that strace showed a repetitive stream of stat() calls while the process was hogging the CPU.

This also turned out to the be the great kind of bug that was reproducible. The first testing server I tried the steps from the bug report on showed the exact same symptoms. I grabbed some very verbose logs by turning on the --debug logging in puppet agent and logging all of the system calls with strace at the same time:

$ TZ=UTC strace /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose \
   --no-daemonize --no-splay --debug 2>&1 |
   tee /tmp/loud-puppet-strace.log

Looking at the strace messages there was clearly a pattern of stat() calls for .rb files in unexpected numbers. Puppet was pretty obviously searching for ruby files that were related to several defined types implemented in our manifests. The log was full of lines like stat("/var/lib/puppet/lib/puppet/type/git::clone.rb"). A little searching led me to PUP-2924 which explained that Puppet was checking to see if the type had been implemented as a custom type in Ruby code first before looking for a defined type in the Puppet manifests. In our case, there were 17 possible paths for a Ruby class to be loaded from which led to 17 failed stat calls for each defined type in the manifest.

What this did not explain however what why there were so many checks for our git::clone resource. Two million, two hundred ninety three thousand, six hundred and seventy seven calls to stat() for the same collection of files in this one puppet run. Insanity!

$ grep stat\( loud-puppet-strace.log | grep git::clone | wc -l
2293677

So now I knew what was happening, but I needed to dig deeper to try and figure out why it was happening. For this I wanted even more verbose puppet agent output.

$ TZ=UTC /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose \
     --no-daemonize --no-splay --debug --trace --evaltrace --noop 2>&1 |
     tee /tmp/puppet-noop.log

I watched this run happen in real time and took note of what was logged just before the long pause in logging which accompanied each CPU utilization spike that I now knew correlated to the outrageous number of stat() calls.

Info: Git::Clone[vagrant]: Starting to evaluate the resource
Info: Git::Clone[vagrant]: Evaluated in 0.01 seconds
[... long pause here ...]
Info: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Starting to evaluate the resource
Info: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Evaluated in 0.00 seconds

This led to my ah ha moment and an eventual fix. The File[/srv/vagrant] resource had a definition that looked something like this:

file { '/srv/vagrant':
    recurse => true,
    owner   => 'vagrant',
    group   => 'www-data',
    require => Git::Clone['vagrant'],
}

The intent of this was to recursively manage the ownership of files in the /srv/vagrant directory. Seems pretty simple right? chown -R vagrant:www-data /srv/vagrant would do the same thing at a command prompt.

It turns out however that what Puppet does under the hood is more complicated. The recurse => true flag makes Puppet do the equivalent of a find command on the /srv/vagrant directory and then create a new File resource for each file and directory found that replicates the other settings of the parent type.

file { '/srv/vagrant/file1':
    owner   => 'vagrant',
    group   => 'www-data',
    require => Git::Clone['vagrant'],
}
file { '/srv/vagrant/file2':
    owner   => 'vagrant',
    group   => 'www-data',
    require => Git::Clone['vagrant'],
}
# ... Lots and lots more file resources here ...
file { '/srv/vagrant/subdir/subdir/subdir/fileN':
    owner   => 'vagrant',
    group   => 'www-data',
    require => Git::Clone['vagrant'],
}

All of these resources are added to the internal DAG (Directed Acyclic Graph) and then evaluated one by one. Our /srv/vagrant directory can have a lot of files beneath it. In my testing server there turned out to be about 135,000 files. So Puppet added 135,000 extra nodes to the DAG and as it placed each one it called stat() 17 times to see if there was a Ruby class providing the git::clone resource that Puppet wanted to ensure that the new File resource followed.

YIKES!

I think there are probably several opportunities here for optimizations in the Puppet implementation itself. Caching the implementation of the git::clone resource would be one that comes to mind pretty quickly. Making recursive File resources operate based on one node rather than N would be another. There is probably some kind of graph insertion change that could be made as well. If I was more comfortable with Ruby I might take a stab at one or more of these myself.

To fix the bug at hand however I looked around and found that we really didn't need to bother with the recursive chown at all, so I was able to remove the whole File[/srv/vagrant] resource from the manifest and let our git::clone implementation create the directory when it performed the initial git repository clone operation.

GnuPG key transition statement

2014-05-15T22:33:39-06:00

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1,SHA512

I am transitioning GPG keys from an old 1024-bit DSA key to a new
4096-bit RSA key.  The old key will continue to be valid for some time,
but I prefer all new correspondence to be encrypted to the new key, and
will be making all signatures going forward with the new key.

This transition document is signed with both keys to validate the
transition.

If you have signed my old key, I would appreciate signatures on my new
key as well, provided that your signing policy permits that without
re-authenticating me.

The old key, which I am transitioning away from, is:

  pub   1024D/0x41E5C23F0F8E76D6 [created: 2004-10-14]
    Key fingerprint = FE97 560A 1C17 F268 1A20  5B80 41E5 C23F 0F8E 76D6

The new key, to which I am transitioning, is:

  pub   4096R/0xC139E10FD9F20FC1 [created: 2014-05-16]
    Key fingerprint = 7DFA 4AEF AC15 8BFC 151D  2DD8 C139 E10F D9F2 0FC1

To fetch the full new key from a public key server using GnuPG, run:

  gpg --keyserver keys.gnupg.net --recv-key 0xC139E10FD9F20FC1

If you have already validated my old key, you can then validate that the
new key is signed by my old key:

  gpg --check-sigs 0xC139E10FD9F20FC1

If you then want to sign my new key, a simple and safe way to do that is
by using caff as follows:

  caff 0xC139E10FD9F20FC1

Please contact me via e-mail at &lt;bd808@bd808.com&gt; if you have any
questions about this document or this transition.

Bryan Davis
&lt;bd808@bd808.com&gt;
2014-05-15

-----BEGIN PGP SIGNATURE-----

iEYEARECAAYFAlN1m3cACgkQQeXCPw+OdtYXEwCfXUThM0JsPacy1bCBQ6rZpWRY
dAcAoIMg91zhQlgo2DJCu3o9BUzCqEJuiQIcBAEBCgAGBQJTdZt3AAoJEEhMmO+k
BO60xv0QAJEV8VYVqpIdEoZWRYw6sGJVmkTCs2rC4OC68/W+1e41hgPqE+i+6ACU
2MhwusMQhsBu1QpyeWXTOEPMU4rvwwlMeQnIlg+DEFGn2k3qJfxeYooO1Ni9n0US
fb676RByWnaAZUYPebNrmTvk5bv/M5BSU8XDfPmDsFk5hzeOa1j1kw9Loffr74LL
LJozHb8Uj9fMZj1f8SzqlhyqPVUWqF3AEE3Dl14Wl2FH507ZzpMwuOetj65KxeiJ
Iee2Hhu6TvQcqs6erxMrsVFxuYz9s1eJzo7feEL22Z8Nm46KSF6x43lpt8ebiKDU
zxzdjLBRQOYf3KcCHE2HvbGxPqEfKkwmCJcd1a3Bd/7sgPXrKsJbeCg8LD2x8aTT
DHGXUEVbMv8r3qMAlKXxJ8iBJ9AvdG0nKneVJ8gB6YkCPlSuDlh2bL3CrMPQ5Db+
vtI0EwuGHSMocWX5cns3t31/iWdoOJ/8lvXJoauT+TVmenmhQ0mU71+whVlnahhr
fhKqsZHM4Nryve8LOntndzAIRUK9EZom1ZGfxzEgfgheg0boMfbk9+dS38zVxjmx
EZ4JuTVvAUv4ZgG553JaKed278wNPxdXSqaXggV+HAceFkaW80M6uQhvOCXX+T05
1HCfl3sQmGkYZ1f3DPrcur0jm+PkHPB4Jw29wogBFU0d7dDJQ0qv
=l8Kz
-----END PGP SIGNATURE-----

How do you know when you're done?

2014-01-14T21:48:20-07:00

In scrum a story is "Done" when it meets the team's shared "Definition of Done". The definition of done is roughly a list of requirements that all parts of the software increment must adhere to to be called complete. Like most things in scrum the implementation details are left to the team to decide. When I was first working with scrum I had a hard time finding examples of what a typical definition of done would include. Most scrum authors (and even many trainers) wave their hands and say that it's too specific to the team and their environment to generalize.

Intellectually I agree with this, but pragmatically I think that having some sort of rough draft of ideas to start from makes writing the first draft easier. This particular definition of Done is written from the perspective of a cross functional team responsible for implementing features in a product. It does not include Done criteria for the operations or support teams that will maintain the deployed software or assist customers in its use. It does however include deliverables that must be produced by the development team to support those additional teams.

This list taken as a whole looks pretty daunting. It turns out that producing production ready software is hard work. It is such hard work that it takes a group of well trained individuals working as a team to complete properly. This list is a recipe that can and should be used by the team to ensure that they produce an increment that is worthy of their combined energy. When used properly it will increase the reputation and worth of the team, their product and the organization.

Done with grooming a story

A groomed story is clear, feasible and testable.

Business Goal described: Why will we build this?
Acceptance criteria defined: What will it do?
Tasks identified: How will we do it?
Story points estimated: What will it cost?

It may take several iterations to achieve this level of clarity. In fact anything that can be quickly groomed is necessarily trivial. It may still take significant time to implement, but it would have to be a variation on work that has already been done that is understood by the whole team.

Themes, Epics and large stories will need to be decomposed into smaller parts. This must happen recursively until the smallest parts are describable using the criteria established above.

Spikes or other research may need to be done to remove uncertainty about new tech or legacy impact. These things are stories in their own right and should be treated as such. R&D must be a traceable expense and is just as important as the final product/feature.

Done with a story

Everything from "Done with grooming a story": A story must be groomed before it can be implemented.
Design complete: Design is not one size fits all. Some stories must have UML and detailed functional descriptions. Others will only need a statement of "do this just like we always do an X feature." The level of design required should be determined during grooming by the team.
Design artifacts in wiki/bug tracker/other: Design isn't complete until it's tangible artifacts are available to the team and the business.
Design reviewed by peers: Similar to a code review, design should get a once over by at least one tangentially involved party to ensure that the level of detail is appropriate to the story and that the proposed implementation makes sense.
Code complete: All code for the story has been written.
Unit tests written: Unit tests have been written to verify that the code works at the most basic level. This can be done via TDD or code-then-test as best suits the team and the story.
All code checked into version control: Feature code and tests are committed to version control.
All unit tests passing: Unit tests are passing in all testable environments.
Automated code checks passing: Coding style, lint and other common automated code quality measurements are passing according to the organization's definition of passing.
CI tests passing: Automated tests in the continuous integration environment are passing.
Peer code review completed: A code review has been completed involving at least one tangentially involved party.
Material defects from code review addressed: All questions and defects raised in the code review have been addressed.
All acceptance tests (manual and automated) identified, written and passing.: Given/When/Then style or other detailed acceptance tests for the story have been written and verified either with automated tests or manual testing. Automated tests are preferred as they do not increase the overall manual testing load of the product.
Help/documentation updated: "Just enough" help and documentation has been produced so that the feature can be used by clients, maintained by developers and supported by customer service and operations.
Release notes updated: Deliverable artifacts and deployment procedures have been documented.

Done with a sprint

Everything from "Done with a story": All stories in the sprint must be done (or returned to the backlog) for the sprint to be done.
Released to beta/integration environment: The deliverables identified in the release notes for the Sprint must be deployed in the beta/integration environment.
Demoed in beta/integration environment (UAT): The demonstration of the increment to Product Owner and other Stakeholders must be performed from the beta/integration environment.
Approved by Stakeholders: The increment must be approved following UAT.
CI/automated tests passing: All automated tests against the product must be passing.
Integration tests passing: Manual integration tests for the product must be passing.
Regression tests passing: Manual regression tests for the product must be passing.
Code coverage for automated tests meets acceptable guidelines.: Code coverage measurements for unit tests must be within acceptable ranges.
Performance tests passing: Performance/scaling tests must return results within acceptable ranges.
Diagrams/documentation updated to match final state: Documentation for design, implementation, deployment, support and use must be updated to match the completed increment.
Bugs closed or pushed into backlog: Defects identified in UAT, QA and development must be resolved or appended to the backlog for Product Owner triage.
Unfinished stories pushed into backlog: Any work in the sprint which does not meet this definition of done will be returned to the backlog. The Sprint isn't done as long as any non-done issues are associated with it.

Done with a QA/staging release

Everything from "Done with a sprint": All Sprints that are to be included in the release must be Done.
Operations guide updated and approved by Ops: The support documentation delivered to Ops via the wiki must be updated and those updates must be approved (UAT) by the Operations team.
Automated tests passing: All automated tests available for the QA/Staging environment must be passing.

Done with a production release

Everything from "Done with a QA/Staging Release": A successful QA/Staging release is a prerequisite for a Production release.
Stress/Load tests passing: Stress/Load testing in the QA/Staging environment must return results within acceptable ranges.
Network/Component diagrams updated: Documentation for design, implementation, deployment, support and use must be updated to match the proposed release.

FileVault2 Hacks

2013-12-09T21:35:52-07:00

Mac OS X 10.7 introduced a whole disk encryption service called FileVault2. This allows you to use AES 128 encryption to protect your data. This is a great feature but it has a few small drawbacks. It uses the password of your primary user account to unlock the system. I'm a fan of strong passwords but for encryption I'd prefer to use a longer pass phrase for increased entropy. Second the EFI-boot screen that is used to get the password to decrypt the disk shows the display name of all usersthat can unlock the system rather than blank fields for both username and password. This leaks information that I would really rather not leak. Fortunately I've found a little hack to work around both of these issues.

The key to my fix lies in this statement from the documentation:

Users not enabled for FileVault unlock are only able to log into the computer after an unlock-enabled user has started or unlocked the drive. Once unlocked, the drive remains unlocked and available to all users, until the computer is restarted. "OS X: About FileVault 2"

My fix is to create a new local user account that will only be used to unlock the disk encryption key. This will provide a fix for both issues. Since this account won't be my primary account I can give it a much longer password without risk of RSI every time that OS X prompts me for an administrator password to install or update software. I can also give the user an innocuous display name to be shown on the unlock screen.

Create a new account from the Users & Groups control panel:
New Account: Standard
Full Name: **
Account name: encrypt
Password: omg this is a really long passphrase for me to remember
Follow the instructions for enabling FileVault 2 and chose the new user as the only user who can unlock the disk.

If you already have FileVault 2 enabled you will need to remove the decryption right from the existing users. The easiest way I've found to do this is using the fdesetup command line tool. sudo fdesetup list will show you the accounts that are enabled. sudo fdesetup remove -user bd808 will remove the certificate for the bd808 user.

One last step is to make the new encrypt user log out as soon as they log in. This will return control to the normal OS X login system where you can configure the login screen to display username and password prompts instead of a list of local user accounts. There are probably several ways to do this, but I chose to make a small application that executes this apple script command:

logout

ignoring application responses
  tell application "loginwindow" to «event aevtlogo»
end ignoring

Yaml 1.1.1 PECL Module Released

2013-11-18T22:20:43-07:00

I'm glad to announce that I finally got around to releasing the bug fix version of the YAML PECL module that I announced on 2013-04-23. Version 1.1.1 fixes several long standing bugs:

#61770 Crash on nonunicode character
#61923 Detect_scalar_type() is not aware of base 60 representation
#63086 Compiling PHP with YAML as static extension fails
#64019 Segmentation fault if yaml anchor ends with a colon
#64694 Segfault when array used as mapping key

It also includes a small but important patch from a community member who discovered that I had left the yaml_emit_file() method marked as unimplemented when it actually was fully functional.

I hope the users of this extension will find the changes to be useful. I also welcome bug reports, feature requests and patches from the community. I would especially appreciate it if someone found the time become the maintainer of a Debian package for the project to make it a little easier for some users to install.

Planning Work in a Sprint

2013-10-27T20:05:00+00:00

We've been having some discussions at $DAYJOB about process and methodologies. The topic of late is scrum and how it may or may not be helpful for the particular group I work with. I've been providing some anecdotal input based my past experience with scrum and other methodologies/frameworks/practices and asking questions about what problems the group is hoping to find new solutions for.

I started to write a big wall o' text™ email about a particular topic and then decided that maybe a blog post would be a better way to work through my idea. So dear reader¹, here are some of my highly opinionated and mostly unsubstantiated thoughts about a process that a group of people could use to plan a scrum sprint (or really any other iterative unit of work).

Pick some work you think you can get done

Step one, pick some work. Sounds easy, but pick it from where? Well that's a damn good question but one that's going to depend on your environment. For the sake of this post let's assume that you have access to an ordered list of features that need to be implemented. Let's further assume that the team and the stakeholders have talked about these features a little and that the team has reached a general consensus about how big the top few features are relative to features the team has worked on in the past. Scrum calls this a "groomed backlog", but you can call it whatever you'd like.

Now that you know where the work comes from, pick some. How much? Well, as much as the team thinks it can get done in the iteration. Without knowing your team and the length of the iteration and how tricky the problems are I can't tell you. Just go with your collective gut and pick some. If you pick too little you can always come back and get more. If you pick too much the team can use that experiential data to adjust when choosing for the next iteration. Just pick some work for now and adjust in the future based on what happens during the iteration².

Figure out what ties the work together

Step two, come up with a narrative about why you chose the work. A list of features and bugs you want to implement is a great start, but you can do better. It will be a lot easier for the team to make good choices during the iteration if they have a more noble goal than "cross off all the things on this list." If the goal is just to get each item done it's more likely that people will think of each part in isolation rather than thinking about how this work builds on what came before and enables more enhancements in the future.

This step may lead you to switch out some of the things you've chosen with other things that are in the backlog to make a more cohesive story. That's ok as long as you keep the most important thing. After all that's the MOST important thing; if you get it done plus some other stuff and everything works people should be happy.

This narrative you've created and the work that supports it are the forecast for the iteration. The product owner can take this information back to the rest of the stakeholders and tell them what to expect to hear about in the demonstration meeting at the end of the iteration. Be careful not to tell them that all of this work will be completed. The team has said they'll try to do this but they can't promise that it will get done any more than the stakeholders can promise how much money will be raised or how many new customers will be acquired.

Figure out how to do the work

The last step before you close the planning meeting and get back to "real work" is to figure out how to actually do the work. We're talking about agile practices here so nobody should expect a gantt chart chart or a architecture document, but ~~anarchists~~ agile teams need enough of a plan to do today's work efficiently. The product owner doesn't need to stick around for this half of the meeting. The team should have enough information from the feature descriptions already given to make the tactical plan.

I'm sure there are other methods that would work as well, but I've personally had success with a process that starts with finding dependencies. The team looks at the stories and tries to determine their rough interdependencies. The goal here is to identify communication interfaces that need to be specified and sequential implementation order dependencies. The team also looks for areas of uncertainty that could be resolved with tech spikes and/or further investigation of the requirements.

Once you've got the dependencies sorted, start breaking down the most obvious starting point features. Make a list of the smaller tasks that need to be completed to finish the feature. Repeat the process by breaking those tasks down into even smaller tasks. Stop when the leaf tasks are "small enough". My rule of thumb is that something that feels like it will take 3-5 ideal hours is small enough. Getting smaller than that early on is probably a waste of time, but staying larger leaves more uncertainty and risk in the plan. Scrum calls this step "story decomposition".

The list of decomposed tasks that the team has created is the start of the iteration backlog. Just like the product backlog this needs to be put in order so that when a team member or pair needs more work to do they can just pull the next most important thing in their area of expertise off of the backlog. You'll reorder the list as the iteration progresses, but get started by ordering the tasks you just decomposed.

If you only have a few features to break down, continue to do the work as a group. If there are quite a few to get through you can split up into appropriately skilled groups and work in parallel. Depending on your team and the time you have left in the meeting (two hours per week of iteration is a suggested total duration), you may have time to outline all of the features. You need to at least outline enough to keep the whole team occupied for the rest of today and tomorrow.

If you have some high risk things to accomplish in the iteration try to break them down as early as you can so that someone (or some pair) can start on the tech spikes or API design or whatever sooner rather than later. Don't forget to put a "decompose feature X" task onto the backlog for any stories that you didn't have time to get to by the end of the time box.

Get to work

Now you've got a list of features to implement, a narrative about why these things go together and at least a day or two of granular tasks to start working on. Each team member or pair now needs to select one thing to begin working on. Start by choosing the highest priority task that you have the skill set to accomplish. When you get Done³ with the task you've taken come back to the backlog and chose another. Don't forget to mark the things you are working on as in progress by whatever tracking mechanism the team is using so you and another team member don't duplicate the work.

Whew. That would have been a nasty email to read. I hope you like it better as a blog post. Don't forget to use inspection and adaptation to refine this process so that it works well for your team. I think I've given a reasonable outline of a process that has worked for me in the past, but never be afraid to look for ways to improve.

Hi Mom! ↩
"Inspect and adapt" is a common refrain in scrum. ↩
"Definition of Done" is a topic for another post. ↩

Creating a Self-signed Code Certificate for XCode

2013-10-21T21:38:00+00:00

I wanted to make my own build of Textual the other day and needed a code signing certificate to complete the build. I decided to make single, long-lived certificate to that I could reuse for building multiple applications.

Open the "Keychain Access" application

bash open -a "Keychain Access"
Application menu > Certificate Assistant > Create a Certificate...
Configure your new certificate
- Name: Self-signed Applications
- Identity Type: Self Signed Root
- Certificate Type: Code Signing
- [x] Let me override defaults
- Continue
- Change expiration date
- Validity Period (days): 3650
- Continue
Just keep hitting Continue to accept defaults from here on out

Note: Xcode seems to cache certificate info on startup. If you had XCode open while you created this certificate, restart it.

I have since used this same certificate to build Growl and a couple of other apps. I'm thinking that I'll export the public certificate and import it on my other OSX hosts so I can share the compiled binaries from machine to machine without needing to recompile them.

Managing my laptop with Boxen

2013-10-14T22:11:00+00:00

Boxen is a framework and collection of libraries created by the fine folks at GitHub to make setting up and managing Mac OS X computers easy and repeatable. Rather than a simple set of shell scripts or other provisioning tools, Boxen uses Puppet to automate installing and configuring software. I don't have the time or space to explain how great Puppet is a configuration management is, so you'll have to trust me or go do your own research.

Anybody could take a stab at rolling their own collection of Puppet manifests to manage their laptop or their corporate install base. That's actually exactly what GitHub did to create Boxen. Having tried (and failed) at doing just that before I was pretty impressed when I gave Boxen a test drive. GitHub has not only provided a system that "works for them"; they have also managed to engineer a reasonably extensible solution for a very complex problem.

You can use your favorite search engine to find folks who can wax poetic about the magnitude of this accomplishment. Let's get on with a description of what I've been able to do with it.

I'm using Boxen to manage my $DAYJOB laptop. This was a great place to start because I had a brand new laptop that needed to be setup and a brand new tool to use to do it. I started by following the bootstrapping instructions to create my own copy of the template project. I made a few changes to the site manifest and then started working on a manifest for myself.

Along the way I decided I didn't like a few of the decisions that the Boxen architects had made. As I pointed out earlier, the team behind Boxen anticipated this and changing most things is as easy as forking a repo, making your change and updating the Puppetfile in your Boxen project.

At the moment I have customized or created these repositories:

my-boxen: My fork of boxen/our-boxen.
puppet-boxen: Fork of the core boxen/puppet-boxen modules that installs Homebrew in /usr/local instead of under /opt/boxen.
puppet-dnsmasq: Fork of boxen/dnsmasq that uses the stock Homebrew dnsmasq install and provides dnsmasq::address to configure new address mappings.
puppet-geektool: Original module to install GeekTool.
puppet-git: Fork of boxen/puppet-git to use the stock Homebrew version of git.
puppet-growl: Fork of petems/puppet-growl that installs an aging version of Growl. I've since abandoned this in favor a self-compiled version which I should figure out how to Puppetize.
puppet-homebrew: Fork of nybblr/puppet-homebrew that adds support for installing in /usr/local and using custom Homebrew taps.
puppet-monolingual: Original module to install Monolingual.
puppet-osx: Fork of codec/puppet-osx that pulls in patches from joebadmo/puppet-osx and adds a few system settings of my own.
puppet-slimbatterymonitor: Original module to install SlimBatteryMonitor.

The one thing I most wish someone would figure out how to do with Boxen/Puppet is install apps from the Mac App Store.

Hacking GitHub Contributions Calendar

2013-04-17T21:06:00+00:00

GitHub profile pages include a neat visualization of commit history that they call the "contributions calendar". This 53x7 grid shows the number of commits and other GitHub interactions that the user performed on each day for the last year.

Each cell in the graph is shaded with one of 5 possible colors. These colors correspond to the quartiles of the normal distribution over the range [0, max(v)] where v is the sum of issues opened, pull requests proposed and commits authored per day.

If your all time high for the last year was 100 contributions in a single day, the cells would color like this:

Contributions	Color
0
1 - 24
25 - 49
50 - 74
75+

A tweet got me interested in the possibility of gaming the interaction data to control the display:

GitHub users might find this guy's contribution graph interesting/funny: github.com/will
— Peter Cooper (@peterc) April 12, 2013

@will has done something to make his calendar spell "WILL" over and over. Looking at his contribution activity list it was pretty obvious that this trick had something to do with the will/githubprofilecheat and/or will/githubprofilecheat2 repositories.

I did some digging in the git documentation to see how hard it is to fake the date on a commit. It turns out that it's as easy as setting an environment variable. The GIT_AUTHOR_DATE and GIT_COMMITTER_DATE environment variables can be used to provide git-commit-tree with dates for the author and commit dates that are attached to each commit object.

Armed with this bit of trivia I decided that I would try to do something interesting with my contributions graph. I didn't just want to copy @will and write my name in the graph. I decided that I would pay homage to my gravatar instead and make a series of gliders that ran across the timeline. The result of my experiment can be seen in the image at the top of this post.

The script that I used to generate the commits with faked dates is available in my bd808/profile-life repository.

The script takes the path to a pattern file and an optional start date as arguments.

./bin/pattern-to-commits.sh patterns/glider.cells 2012-04-15 | sh

The pattern file is expected to be in the plaintext Life format. This format allows you to specify an on/off pattern. When a cell is "on" the script will output 23 commits (one per hour) for the corresponding day. "Off" cells won't generate any commit activity.

!Name: Profile Glider Train
!A simulation of a glider cruising across the contributions timeline.
O.O...O.O....O..................................................................
.OO.O.O..OO...O.O.O...O.O....O..................................................
.O...OO.OO..OOO..OO.O.O..OO...O.O.O...O.O....O..................................
.................O...OO.OO..OOO..OO.O.O..OO...O.O.O...O.O....O..................
.................................O...OO.OO..OOO..OO.O.O..OO...O.O.O...O.O....O..
.................................................O...OO.OO..OOO..OO.O.O..OO...O.
.................................................................O...OO.OO..OOO.

The script reads in this file a column at a time using the cut command. It loops over the characters from the column and when it finds an O it echos commit commands to stdout:

GIT_AUTHOR_DATE='2013-04-17T20:00' GIT_COMMITTER_DATE='2013-04-17T20:00' \
git commit --allow-empty -m '2013-04-17T20:00'

This output can be piped to bash to apply the commits to the repository.

An interesting extension of this script would be to support all 5 possible colors. It would also be nice if the script read your current contribution history to determine how many commits are necessary to hit the 4th quartile every time. For now these additions are left as an exercise for the reader. :)

Using GitHub issues for comments

2012-04-14T20:22:00+00:00

I was inspired by Ivan Zuzak's post to try using GitHub issues on the repository for this blog to collect and display reader comments. I'm using Octopress to generate the site, so I decided to make some customizations to make applying Ivan's ideas easy for me.

I started by adding a new configuration setting to my _config.yml file: github_comments: true. I'll use this configuration switch to turn the new feature on in other places in the codebase.

Next I changed the Liquid template in source/_layout/post.html to include a link to the comment thread for the post. I added this block right after the existing disqus rendering block:

source/_layout/post.html

{% if site.github_comments and page.github_issue_id %}
<section id="comments">
  <header>
    <h2>Comments</h2>
    <p>Visit <a href="https://github.com/{{site.github_user}}/{{site.github_user}}.github.com/issues/{{page.github_issue_id}}">this post's issue page on GitHub</a> to add a comment.</p>
  </header>
</section>
{% endif %}

If the github_comments: true flag is set and the yaml front matter for the post contains a github_issue_id: N setting, this block with display a link to issue N in the associated GitHub repository.

Next I wanted to display any current comments. I use a slightly tweaked version of Ivan's javascript to do this.

source/_includes/github_comments.html

{% if site.github_comments and page.comments == true %}
<script type="text/javascript">
$.ajax({
    url: "https://api.github.com/repos/{{site.github_user}}/{{site.github_user}}.github.com/issues/{{page.github_issue_id}}/comments"
  , method: "get"
  , headers: { Accept: "application/vnd.github.full+json" }
  , error: function(e){}
  , success: function(resp){
      var cuser, cuserlink, clink, cbody, cavatarlink, cdate;
      for (var i=0; i<resp.length; i++) {
        cuser = resp[i].user.login;
        cuserlink = "https://github.com/" + resp[i].user.login;
        clink = "https://github.com/{{site.github_user}}/{{site.github_user}}.github.com/issues/{{page.github_issue_id}}#issuecomment-" + resp[i].url.substring(resp[i].url.lastIndexOf("/")+1);
        cbody = resp[i].body_html;
        cavatarlink = resp[i].user.avatar_url;
        cdate = (new Date(resp[i].created_at)).toLocaleString();

        $("#comments").append('<div class="comment"><div class="comment-header"><a class="comment-user" href="' + cuserlink + '"><img class="comment-gravatar" src="' + cavatarlink + '" alt="" width="20" height="20"> ' + cuser + '</a><a class="comment-date" href="' + clink + '">' + cdate + '</a></div><div class="comment-body">' + cbody + '</div></div>');
      }
    }
});
</script>
{% endif %}

I added an include for this new file in source/_includes/after_footer.html to get it tacked on to each page:

source/_includes/after_footer.html

{% include github_comments.html %}

Those changes plus the OAuth application configuration described in Ivan's post have the blog all setup for comments. The only problem is that I have to remember to manually create an issue on the GitHub side and add it to the yaml front matter for the post. Being a lazy programmer I wanted to get rid of that burden as well. Lucky for me Octopress already has a Rake task that sets up a new blog post. The changes I made here aren't pretty, but they are pragmatic.

Rakefile

def create_comment_issue(title, url)
  require 'octopi'
  include Octopi

  authenticated :config => "_github.yml"  do
    user = User.find("bd808")
    repo = user.repository(:name => "bd808.github.com")

    issue = Issue.open :user => user, :repo => repo,
      :params => {
      :title => title,
      :body => "Reader comments on [#{title}](#{url})"
    }
    puts "Successfully opened issue \##{issue.number}"

    labels = issue.add_label "blog-post"

    return issue.number
  end
end

I plugged this function into the existing new_post task so that it will create an issue and plug it's id into the front matter for the new post automatically when I run a command like rake new_post["Using GitHub Issues for Comments"]:

source/_posts/2012-04-14-using-github-issues-for-comments.markdown

---
layout: post
title: "Using GitHub issues for comments"
date: 2012-04-14 20:22
comments: true
github_issue_id: 7
categories: 
---

Generating an Apparently Random Unique Sequence

2012-03-31T15:40:00+00:00

Using a sequentially increasing counter to generate an id token is easy. Database sequences and auto-number columns make it fairly trivial to implement. If that isn't available a simple file or shared memory counter can be implemented in minutes. Displaying such a number to a client however may give them more information than you would really like them to have about the number of ids you are allocating per unit time. We'd really like to obfuscate the id somehow while retaining the uniqueness of the original sequence.

One way to do this is to use a combination of multiplication and modulo arithmetic to map the sequence number into a constrained set. With careful choice of the multiplicative constant and the modulo value the resulting number can be made to wander rather effectively over the entire space of the target set.

The basic math looks like this: f(n) := (n * p) % q

n := input sequence value
p := step size
q := maximum result size

p and q must be chosen such that:

p < q
p * q < arithmetic limit (2^31, 2^32, 2^63, 2^64, ... depending on the precision of the underlying system)
p ⊥ q (coprime or relatively prime)

With p := 5 and q := 12 our function will generate this output:

n	1	2	3	4	5	6	7	8	9	10	11
f(n)	5	10	3	8	1	6	11	4	9	2	7

Change p to 7 and you'll get:

n	1	2	3	4	5	6	7	8	9	10	11
f(n)	7	2	9	4	11	6	1	8	3	10	5

The rational for keeping p * q < limit is that as n approaches q the initial multiplication will approach p * q and if this calculation overflows the available precision the result will wrap back into a previously traversed space causing duplication. The same sort of thing will occur if p and q are not coprime. The result of the modulo will exhibit a period equivalent to the GCD¹ of p and q rather than mapping the entire range of q evenly.

Careful choice of p and q are key to getting a good spread in the output of the function and maintaining the uniqueness of the result. One easy way to ensure that the chosen coefficients are coprime is to make them both be prime powers of prime numbers (eg 9^17, 13^11, 13^15, 19^7, ...).

This method is a type of Linear congruential generator almost exactly equivalent to the Park–Miller random number generator.

Examples

PHP

<?php
/**
 * Obfuscate an id generated from a linear sequence.
 *
 * @param int $n Input value
 * @param int $p Random walk step size
 * @param int $q Maximum result value
 * @return int Obfuscated result
 */
function obfuscate_id ($n, $p, $q) {
  return ($n * $p) % $q;
}

PL/SQL

FUNCTION obfuscate_id (n NUMBER, p NUMBER, q NUMBER) RETURN NUMBER IS
BEGIN
  RETURN MOD(n * p, q);
END f;

Thanks to Tim for explaining all of this to me several times without becoming annoyed at the parts I wasn't getting.

Greatest Common Divisor ↩

FizzBuzz — the wrong way to do it

2012-01-18T21:47:00+00:00

Write a program that prints the numbers from 1 to 100. But for multiples of three print "Fizz" instead of the number and for the multiples of five print "Buzz". For numbers which are multiples of both three and five print "FizzBuzz". imranontech [at] googlemail.com

Python

#!/usr/bin/env python
for i in xrange(1, 101):
  print (not i % 3) * "Fizz" + (not i % 5) * "Buzz" or i

PHP

<?php
$p = "printf"; $r = "str_repeat";
for ($i = 1; $i <= 100; $i++) {
  $p("%s\n", $r($i, $p("%s%s",
      $r("Fizz", !($i % 3)), $r("Buzz", !($i % 5))) == 0));
}

Bash

#!/usr/bin/env bash
for i in {1..100}; do
  if   [ 0 = $(($i % 15)) ]; then echo "FizzBuzz";
  elif [ 0 = $(($i %  3)) ]; then echo "Fizz";
  elif [ 0 = $(($i %  5)) ]; then echo "Buzz";
  else                       echo $i;
  fi
done