Planet Twisted

December 18, 2017

Moshe Zadka

Write Python like an expert

Ten tricks to level up your Python.

Trick 0 -- KISS

Experts know about the weird dark corners of Python -- but do not use them in production code. The first tip is remembering that while Python has some interesting corners, they are best avoided in production code.

Make your code as straightforward as possible.

Trick 1 -- The power of lists

The humble list, or even humbler [], pack a lot of punch -- for those who know how to use it.

It serves, of course, as a useful array type. It is also a good stack, using append and pop(), with the correct (amortized) performance characteristic. The .sort() method is sophisticated enough it is one of the few cases where Python actually broke new theoretical grounds on a sorting algorithm -- timsort was originally invented for it.

Trick 2 -- The power of dicts

The humble dict, or even humbler {}, also pack a lot of punch.

While many use string keys, it is important to remember any immutable type is possible as keys, including tuples and frozensets. This helps writing caches, memoizers or even a passable sparse array.

The keyword argument constructor also gives it a lot of power for making simple and readable APIs.

Trick 3 -- Iterators and generators

The iterator protocol is one of the most powerful aspects of Python. Experts understand it deeply, and know how to use it to make code shorter, more readable, more composable and more debuggable.

One of the easiest ways to accomplish it is to write functions that accept an iterator and return an iterator: and remembering that generators are really good syntactic sugar for writing functions which return iterators.

If a code base has a lot of functions that return iterators, the iterator algebra functions in itertools become immediately higher value.

Trick 4 -- Collections

The collections module has a lot of wonderful functionality.

For code that needs defaults, defaultdict.

For code that needs counting, Counter.

For FIFOs, deque.

Trick 5 -- attrs

One thing that is not wonderful about the collections module is the namedtuple class.

In almost every way imaginable, the attrs package is better. Also, for things that wouldn't be namedtuples otherwise, attrs is still better.

Trick 6 -- First class functions and types

Return functions. Store them in lists, or dictionaries. Keep classes in a double-ended queue. These are not a "Python does what". These are ways to avoid boilerplate or needless indirections.

Trick 7 -- Unit tests and lint

Experts hate having to waste time. Writing unit tests makes sure they have to fix any given bug only once. Correctly configuring a linter makes sure they do not have to comment on every pull request with a list of nitpicks.

Trick 8 -- Immutability

Immutable data structures, such as those available from the Pyrsistent library, are useful for avoiding a lot of bugs. "Global mutable state is the root of all evil" -- and if you cannot get rid of things being global (modules, function defaults and other things) it is often possible to make them mutable.

Immutable data structures are much easier to reason about, and much harder to make bugs that are hard to find and trigger.

Trick 9 -- Not reinventing the wheel

If something is available as a wheel, don't reinvent it. PyPI has ~125K packages, at times of writing. It is almost certain that it has something that takes care of some of the task you are currently working on.

How to know what's worthwhile?

Follow Planet Python, check Awesome python and, if it is within reach, try to go to Python meetups or conferences. (If it's not, of even if it is, PyVideo has the videos -- but talking to other Python programmers is extremely useful.)

by Moshe Zadka at December 18, 2017 06:00 AM

December 14, 2017

Moshe Zadka

Interesting text encodings (and the people who love them)

(Thanks to Tom Prince and Nelson Elhage for suggestions for improvement.)

Nowadays, almost all text will be encoded in UTF-8 -- for good reasons, it is a well thought out encoding. Some of it will be in Latin 1, AKA ISO-8859-1, which is popular in the western world. Less of it will be in other members of the ISO-8859 family (-2 or higher). Some text from Japan will occasionally still be in Shift-JIS. These encodings are all reasonable -- too reasonable.

What about more interesting encodings?

EBCDIC

Encodings turn a sequence of logical code points into a sequence of bytes. Bytes, in turn, are just sequences of ones and zeroes. Usually, we think of the ones and zeroes as mostly symmetric -- it wouldn't matter if the encoding was to the "dual" byte, where every bit was flipped. SSD drives do not like long sequences of zeroes -- but neither do they like long sequences of ones.

What if there was no symmetry? What if every "one" weakened your byte?

This is the history of one of the most venerable media to carry digital information -- predating the computer by its use in automated weaving machines -- the punched card. It was called so because to make a "one", you would punch a hole -- that was detected by the card reader by an electric circuit being completed. Punching too many holes made cards weak: likely to rip in the wear and tear the automated reading machines inflicted upon them, in the drive to read cards ever faster.

EBCDIC (Extended Binary Coded Decimal Interchange Code) was the solution. "Extended" because it extends the Binary Coded Decimal standard -- numbers are encoded using one punch, which makes them easy to read with a hex editor. Letters are encoded with two. Nothing sorts correctly, of course, but that was not a big priority. Quoting from Wikipedia:

"The distinct encoding of 's' and 'S' (using position 2 instead of 1) was maintained from punched cards where it was desirable not to have hole punches too close to each other to ensure the integrity of the physical card.

Of course, it wouldn't be IBM if there weren't a whole host of encodings, subtly incompatible, all called EBCDIC. If you live in the US, you are supposed to use code page 1140 for your EBCDIC needs.

Luckily, if you ever need to connect your Python interpreter to a card-punch machine, the Unicode encodings have got you covered:

>>> "hello".encode('cp1140')
b'\x88\x85\x93\x93\x96'

If you came to this post to learn skills immediately relevant to your day to day job and are entirely not obsolete, you're welcome.

KOI-8

Suppose you're a Russian speaker. You write your language using the Cyrrilic alphabet, suspiciously absent from the American Standard Code for Information Interchange (ASCII), developed during the height of the cold war between the US of A and the USSR. Some computers are going to have Cyrrilic fonts installed -- and some are not. Suppose that it is the 80s, and the only language that runs fast enough on most computers is assembly or C. You want to make a character encoding that

  • Will look fine if someone has the Cyrrilic installed
  • Can be converted to ASCII that will look kinda-sorta like the Cyrrilic with a program that is trivial to write in C.

KOI-8 is the result of this not-quite-thought experiment.

The code to convert from KOI-8 to kinda-sorta-look-a-like ASCII, written in Python, would be:

MASK = (1 << 8) - 1
with open('input', 'rb') as fin, open('output', 'wb') as fout:
    while True:
        c = fin.read(1)
        if not c:
            break
        c = c & MASK # <--- this right here
        fout.write(c)

The MASK constant, written in binary, is just 0b1111111 (seven ones). The line with the arrow masks out the "high bit" in the input character.

Sorting KOI-8 by byte value gives you a sort that is not even a little bit right for the alphabet: the letters are all jumbled up. But it does mean that trivial programs in C or assembly -- or sometimes even things that would try to read words out of old MS Word files -- could convert it to something that looks semi-readable on a display that is only configured to display ASCII characters, possibly as a deep hardware limitations.

Punycode

How lovely it is, of course, to live in 2017 -- the future. We might not have flying cars. We might not even be wearing silver clothing. But by jolly, at least our modern encodings make sense.

We send e-mails in UTF-8 to each other, containing wonderful emoji like "eggplant" or "syringe".

Of course, e-mail is old technology -- we send our eggplants, syringes and avocadoes via end-to-end encrypted Signal chat messages, unreadable by any but our intended recipient.

It is also easy to register our own site, and use an off-the-shelf SaaS offering, such as Wordpress or SquareSpace, to power it. And no matter what we want to put as our domain, we can...as long as it is ASCII-compatible, because DNS is also older than the end of the cold war, and assumes English only.

Seems like this isn't the future after all, which the suspicious lack of flying cars and silver clothing should really have alerted us to.

In our current times, which will be a future generation's benighted past, we must use yet another encoding to put our avocadoes and eggplans in the names of websites, where they rightly belong.

Enter Punycode, an encoding that is not afraid to ask the hard questions like "are you sure that the order encoded bits in the input and the output has to be the same"?

That is, if one string is a prefix of another, should its encoding be a prefix of the other? Just because UTF-8, EBCDIC, KOI-8 or Shift-JIS adhere to this rule doesn't mean we can't think outside the box!

Punycode rearranges the encoding so that all ASCII compatible characters go to the beginning of the string, followed by a hyphen, followed by a complicated algorithm designed to minimize the number of output bytes by assuming the encoded non-ASCII characters are close together.

Consider a simple declaration of love: "I<Red heart emoji>U".

>>> source = b'I\xe2\x9d\xa4U'
>>> declaration = source.decode('utf-8')
>>> declaration.encode('punycode')
b'IU-ony'

Note how, like a well-worn pickup line, I and U were put together, while the part that encodes the heart is at the end.

Consider the slightly more selfish declaration of self-love:

>>> source = b'I\xe2\x9d\xa4me'
>>> source.decode('utf-8').encode('punycode')
b'Ime-4r6a'

Note that even though the selfish declaration and the true love declaration both share a two-character prefix, the result only shares one byte of prefix: the heart got moved to the end -- and not the same heart. Truly, every love is unique.

Punycode's romance with DNS, too, was frought with drama: indeed, many browsers now will not display unicode in the address bar, instead showing "xn--<punycode ASCII>" (the "xn--" in the beginning indicates this is a punycoded string) as a security measure against phishing: it turns out there are a lot of characters in Unicode that look a lot like "a", leading to many interesting variants on "Paypal.com" and "Gmail.com", which look indistinguishable to most humans -- and turns out, most users of the web are indeed of the homo sapiens species.

by Moshe Zadka at December 14, 2017 04:00 AM

December 11, 2017

Moshe Zadka

Exploration Driven Development

"It's ok to mess up your own room."

Sometime there is a problem where the design is obvious -- at least to you. Maybe it's simple. Maybe you've solved one like that many times. In those cases, just go ahead -- use Test-Driven-Development, lint your code as you're writing, and push a branch full of beautiful code, ready to be merged.

This is the exception, rather than the rule, though. In real life, half of the time, we have no idea what we are doing. Is a recursive or iterative solution better? How exactly does this SaaS work? What does Product Management actually want? What is, exactly, the answer to life, the universe and everything?

A lot of the time, solving a problem becomes with exploratory programming. Writing little snippets, testing them out, writing more snippets, throwing some away when they seem to be going down a bad path, saving some from earlier now that we understand the structure better. Poking and prodding at the problem, until the problem's boundaries become clearer.

This is, after all, why dyamic languages became popular -- Python became popular in web development and scientific computing precisely because in both places, "exploratory programming" is important.

In those cases, every single rule about "proper software development" goes straight out the window. Massive functions are fine, when you don't know how to break them. Code with one letter variables is fine, when you are likely to throw it away. Code with bad formatting is fine, when you are likely to refactor it. Code with no tests is fine, if it's doing the wrong thing anyway. Code with big commented out sections is fine, if those are likely to prove useful in an hour.

In short, every single rule of "proper software development" goes out the window when we are exploring a problem, testing its boundaries. All but one -- work on a branch, and keep your work backed up. Luckily, all modern version control systems have good branch isolation and easy branch pushing, so there is no problem. No problem, except the social one -- people are embarrassed at writing "bad code". Please don't be. Everyone does it. Just don't merge it into the production code -- that will legitimately annoy other people.

But as long as the mess is in your own room, don't worry about cleaning it up.

by Moshe Zadka at December 11, 2017 04:00 AM

December 07, 2017

Itamar Turner-Trauring

The junior programmer's guide to asking for help at work

When you’re just getting started as a programmer and you need help frequently, asking for help can feel daunting, like you’ll lose either way. Ask too soon and you’ll end up feeling stupid for not having figured out the answer on your own. And if you don’t ask for help, your manager can get annoyed that you’re taking too long to solve the problem.

Asking for help is a skill, and a skill you can learn. Once you’ve mastered this skill you will be able ask questions at the right time, and in the right way. In this post I’ll cover:

  • Some ways you shouldn’t ask for help.
  • When to ask for help, so that it’s neither too soon nor too late, by planning your task in advance.
  • How to ask for help in a way that will maximize your learning and make you look better to your manager.

The wrong way to ask for help

There are two main failures when asking for help, asking too much and asking too little.

“Help help help help help help help help”: You will of course have many questions when you’re learning a new codebase or a new technology. But if you’re asking your lead developer a question every 10 minutes, you’re going to annoy them. A lot. You’re impeding their ability to work, and you’re probably not spending enough time learning on your own.

Instead of asking your questions one by one as they occur, write them all down. Then, when your local expert seems to have a free moment, or if it’s been a few hours since you last asked a question, go and ask them all your questions at once. This will be less intrusive, and chances are you will have figured out some of the answers on your own in the interim.

“I don’t want to ask for help!”: Asking for help can be embarrassing, it’s true. And trying to figure stuff out on your own can help you learn. But if you wait too long, or never ask for help, you’ll both learn less and annoy your manager, because inevitably you’ll end up spinning your wheels and wasting time.

Instead, wait until you’ve given it a reasonable try, and then ask. You’ll learn how to do that in the next section.

Knowing when to ask for help with timeboxing

So how exactly do you know when to ask for help? Advance planning. By knowing how long you have to spend on the task, and then setting a timebox, a limited amount of time to work on it on your own, you can have an alert (metaphorical or real) telling you “it’s been too long, time to ask for help.”

Here’s how the process works:

  1. When your lead developer gives you a task, ask how much time you should spend on it. They might say something like “we need that ready in a couple of days, but really it should only take you a day.” So now you know you want to aim to finish the task within a day. (Over time you’ll learn how to do this yourself, and also whether your manager is overly optimistic or pessimistic about their estimates.)
  2. Now that you know your deadline, set a timebox, a limited amount of time that is less than your deadline. If your deadline is a day, you might set it to three hours.
  3. Now start your task. After you hit your timebox (e.g. three hours), see where you’re at: are you making good progress? Great, set another timebox and keep working. Not making progress? It’s time to ask for help.

If your deadline is one day, and you ask for help after three hours, you’ve not asked too late: there’s still time to finish the task. And you haven’t asked too soon, either, you’ve at least tried on your own.

Learning more (and looking good) when you’re asking for help

You’ve hit your timebox, and you’re asking for help: how do you get the most value out of your questions?

Don’t ask yes/no questions: “is this how I do this?”

If your lead developer actually answers with “yes” or “no”, you’re only gaining 1 bit of information, the smallest amount of information possible. Instead, ask open ended questions: “what should the result be like?” “can you walk me through how this works?”, etc.

Always present a potential answer the question.

It doesn’t have to be the best answer, or the correct answer (if it were, you probably wouldn’t be asking for help, after all). But you should always say something like “my best guess is this works like this, because of X and Y, but I’m still a little confused - could you explain this?”

Providing an answer serves multiple purposes:

  1. It forces you to try to come up with an answer and learn more. Sometimes you’ll figure it out on your own!
  2. It demonstrates to your manager that you made an effort, making you look good.
  3. It helps your manager understand what you know and what you don’t, which means they’ll have an easier time helping you.

How to ask for help: a recap

Here’s the short version:

  1. Do ask for help.
  2. Batch up your questions.
  3. Set a timebox on tasks, and ask for help if you hit the timebox and you’re still stuck.
  4. Ask open-ended questions, and always provide a potential answer.

Next time your manager gives you a task, apply these guidelines: they’ll be happier, and you’ll learn more.

December 07, 2017 05:00 AM

December 04, 2017

Hynek Schlawack

Python Application Deployment with Native Packages

Speed, reproducibility, easy rollbacks, and predictability is what we strive for when deploying our diverse Python applications. And that’s what we achieved by leveraging virtual environments and Linux system packages.

by Hynek Schlawack (hs@ox.cx) at December 04, 2017 12:00 AM

December 01, 2017

Itamar Turner-Trauring

"What engineers are not getting at their current jobs": an interview with Lynne Tye

How do you find a job with work/life balance? Most companies won’t tell you “we want you to work long hours” on their careers page, it’s hard to ask, and it’s not like you can go to a job board and search for work/life balance. Until now, that is.

Key Values is a newly launched site that lets you filter jobs by values. Instead of the standard boring “at X we’re passionate about doing Y with technology stack Z” (more on that below), you can search by the things that make a job work or not work for you. That might mean work/life balance, but you can also search for companies that are good for junior devs, or have a flat organization. Different people have different values, and Key Values reflects that.

It’s still early days, so there aren’t a huge number of jobs yet, but I love the concept and wanted to hear more. So I got in touch with Lynne Tye, the creator of Key Values, to hear how she ended up creating such a different, and useful, approach to hiring.

Q. Could you share your background with our readers, how you became a programmer?

LYNNE: I studied brain and cognitive sciences at MIT, then went into a PhD program in neuroscience at UCSF. Two years in, I realized it wasn’t for me, and I dropped out. Then I had a couple of odd jobs while I was soul searching, and a few months later I started working at Homejoy, as an operations manager for the Bay Area, a people manager.

While I was working at Homejoy, I noticed how powerful the engineers were. They could make so much impact with just one line of code, and I always felt frustrated when I needed them to fix a small bug that was making my life a nightmare. What I was doing just wasn’t as scalable, like having lots of 1-on-1 meetings. So after Homejoy, I decided I wanted to learn how to code.

Q. What did you learn from your experience before starting Key Values?

LYNNE: Scientific academia is one of the few industries where there’s a master/apprentice relationship, a very clear structure of mentorship. I think that the way you view relationships, the way you make decisions about joining labs is based on the idea of working relationships that need to be as compatible and symbiotic as possible. A lot of times these mentors stay with you your whole life [like family], your mentor’s mentor is your “grandfather”. I noticed this was lacking when I started doing web development.

After grad school, I was feeling pretty lost and really wasn’t sure what I wanted to do with my life. I had basically been laser focused on becoming an academic professor and research scientist for the last 6 years and hadn’t once looked up to consider a different career. One of the main frustrations I had with research was how slow it was, and slow it was to get feedback.

The environment at Homejoy gave me all of that. It was intense, exciting, fast-paced, and there was constantly feedback from all directions. At the time, it was my dream job. And I think it just also made me realize that it wasn’t for everyone, but it was perfect for me. It made me realize that everyone has their own set of personal values/goals and it’s so important to find work that aligns with those.

Q. Doesn’t sound like there was much work/life balance at Homejoy.

LYNNE: Hahaha, there definitely wasn’t. But, I didn’t want work/life balance! When I left grad school, I genuinely thought work/life balance was a proxy for laziness, or a lack of passion. Of course, after grinding it out at Homejoy for a year and a half, I burned out quite a bit. Afterwards, I wanted work/life balance for the first time. And I found it in the lifestyle I had as a web freelancer.

Ironically, after a couple of years of having so much work/life balance, I started to miss the excitement and sense of urgency of working a lot. That’s where I am now: I’d feel a little sad if I was on a team and the only one working past 6pm or 7pm.

All of this sharpened my views on finding a new career: you need to know what you want, and you should be picky and demanding when you’re evaluating your options.

Q. As an operations manager at Homejoy you did some of the hiring. What did you look for?

LYNNE: It’s funny to look back at it. I didn’t have the language at the time, I didn’t have the framework or language to say what we were. And hiring was so new to me, I didn’t have any experience with it, I hadn’t really articulated the actual values we had.

It was very [intense], we were not shy about it. Everyone worked really late, really early, and on weekends.it felt really exciting, and I don’t think anyone felt like it was work. We all enjoyed spending lots of time together and had all decided we were willing to make that commitment. At the time, I think I was always looking to hire people similar to the existing team.

Q. It seems many companies can’t articulate what they want?

LYNNE: They can’t. Many employers and job seekers have not taken the time to evaluate who they are, where they’re trying to go in terms of culture, and how that impacts hiring. I view my job as helping teams articulate values. And not only helping them articulate them and write them, but also to challenge them, asking whether they’re translating them into actions, or whether they’re things they just write on their website or on their walls.

Q. One my pet peeves in hiring is the focus on particular technologies. Why do you think hiring managers focus on this so much?

LYNNE: In general I think that job descriptions and the way that people are approaching recruiting and sourcing is outdated, given how much information we all have access to now. Previously it was harder to get information about different employment opportunities, so the biggest differentiators were salary, and do you have experience with hard skills we need today. As time has gone by these things are still important, but people have the ability to compare more teams and have more information to compare them by, and job descriptions haven’t reflected this change.

Software development has changed over the years. I can’t speak from experience, but it’s easier to build things today than it was 20 years. The ability to learn technologies, it wasn’t the same conversation it was 20 years ago. I don’t think it makes sense anymore to talk about experience with a particular technology.

Some companies are happy to have people learn on the job, but people just follow the [job posting] template everyone uses:

  • Generic part about what the company does.
  • Generic part about how much you’ll learn, how much fun it is, how much impact you’ll make.
  • Bullet point with requirements, experience with X, Y, Z technology.
  • And then another set of bullet points about benefits and perks, and not-so-compelling reasons to join the company.

Q. Which brings us to Key Values, where you’re trying to do things differently. What exactly does Key Values do?

LYNNE: I try to help job seekers find teams that share their values.

Q. How did you come up with these values?

LYNNE: I interviewed dozens and dozens of engineers. I noticed it’s challenging for people to articulate or identify what they care about most. And I noticed that as people were telling me what they were looking for, it came with a story about a previous experience they had where their job didn’t have that value, and that brought to light why that was so important them.

After interviewing lots of engineers, I spent time thinking about values, and phrasing them in ways where they would apply to many teams, but not every team. For example, had I had “Mission driven” every single team would have selected it, and it wouldn’t help people differentiate between different teams. And I didn’t want to include values that were specific to one, or even zero teams. It was about striking the balance between those two extremes.

Q. How do you figure out what values the companies have?

LYNNE: Initially, I thought it would be more like research, I wanted to interview every engineer on the team, provide statistics. But I realized it’s not scalable, and I didn’t want to force teams to share information they weren’t comfortable sharing. You’ll never find a team that says “we never eat lunch together, we’re not friends, we’re really not social here” or “we have terrible code quality here.”

By limiting how many values [a team can choose] it tells you what they prioritize. Being limited and being forced to rank [the values they choose] is very informative, it discloses a lot of information implicitly.

Q. On your website you have job listings with these values, and you share with them with world. What can tell you from your data about what engineers care about?

LYNNE: The two things visitors pick most are work/life balance and high quality code base. This is both surprising and not surprising at all. [Next is] “remote ok”, although that is is a property, not a value, and I think that makes sense since I still don’t have that many team profiles on Key Values yet. I also think developers are more and more interested in remote opportunities. Close to that are “flexible work arrangements”, “team is diverse”.

To me, these are an indication of what engineers are not getting at their current jobs.

Q. Why is Key Values the only job board that lets you search based on work/life balance?

LYNNE: I don’t think previously there was a way for teams to truthfully tell whether the team really cared. By having a limited list [of values], and priorities, it lets you see who doesn’t prioritize it, otherwise I think most companies wouldn’t volunteer that information. How would you ask? If you poll companies, I can’t imagine any of them wanting to publicly state that they don’t.

At the end of the day, how you define work/life balance has implications, it’s difficult to categorize these things. Anyone who is reading about it, or talking about, it’s pretty divisive and polarizing. Some people think if you work more than 40 hours a week you don’t have work/life balance, but I would disagree. My goal is to give companies a chance to tell us how they interpret work/life balance, and expose people to different definitions of that term.

Q. What does a sane workweek mean to you?

LYNNE: A sane workweek to me wouldn’t be a good description, I’d say i’m looking for a sane work month. I love working, I consider myself pretty industrious, but the flexibility to decide when I work is more important. Sometimes I want to work a ridiculous amount one week, and then take a few days off, maybe have a long weekend. And that’s just in terms of when I’m working, and how much.

In general I don’t believe in 40 hours a week, because I don’t operate that way. I don’t have as regular of a schedule, and would 100% rather work 60 hours a week if I could decide when and where I can work, as opposed to a 9-5 at the same physical place with no flexibility. I’d feel much more suffocated with the latter.

In terms of a relationship with an employer, I think the most important thing to me is working someplace where they genuinely support and show interest in other aspects of my life. And that they share some of their priorities in life with me. [The means] having a network of people around you who understand who you are as a whole and support all of you, For me, it means a lot to not just talk about work at work, but to really interact with one another as friends too. I know for sure that this isn’t true for everyone, but I prefer to blur the boundary between professional and personal. I don’t like having complete work/life separation.

OK, back to Itamar here: that was my interview, and now I’d like to ask for your help. Key Values is as far as I know the only place where you can search for jobs with work/life balance, or other values you may care about. That’s hugely valuable, and so I want to see Lynne’s project succeed. If you agree, here’s what you can do:

  • Is your company hiring? Get in touch with Lynne and get your company listed.
  • Are you looking for a job, or plan to look for one in the future? Go visit Key Values and sign up for the newsletter: the more people use the site, the easier it’ll be for Lynne to get more companies on board.

December 01, 2017 05:00 AM

November 20, 2017

Hynek Schlawack

Python Hashes and Equality

Most Python programmers don’t spend a lot of time thinking about how equality and hashing works. It usually just works. However there’s quite a bit of gotchas and edge cases that can lead to subtle and frustrating bugs once one starts to customize their behavior – especially if the rules on how they interact aren’t understood.

by Hynek Schlawack (hs@ox.cx) at November 20, 2017 06:45 AM

Itamar Turner-Trauring

Young programmers working long hours: a fun job or bad management?

If you’re looking for a job with work/life balance, you’ll obviously want to avoid a company where everyone works long hours. But what about companies where only some people work long hours? If you ask for an explanation of what’s going on at these companies, one common answer you’ll hear is something like “oh, they’re young, they don’t have families, they enjoy their work so they work longer hours.”

As you venture into the unknown waters of a potential employer, is this something you should worry about? Those younger programmers, you’re told, are having a wonderful time. They’re waving their arms, true, and there’s a tangle of tentacles wrapped around them, but that’s just the company mascot, a convivial squid. Really, they’re having fun.

Or are they?

To be fair, there really are companies where programmers stay later at the office because they enjoy hanging out with people they like, and working on interesting problems while they do it. More experienced programmers are older, are therefore more likely to have children or other responsibilities, and that’s why they’re working shorter hours. On the other hand, it may be that the pile of tentacles wrapped around these programmers is not so much convivial as voracious and hostile: the kraken of overwork.

The kraken of overwork

Another potential reason less experienced programmers are working longer hours is that they don’t know how to get their work done in a reasonable amount of time. Why? Because no one taught them the necessary skills.

I’m not talking about technical skills here, but rather things like:

Many managers don’t quite realize these skills exist, or can’t articulate what they are, or don’t know how to teach them. So even if the inexperienced programmers are given reasonable amounts of work, they are never taught the skills to finish their work on time. In this situation having children or a family is not causal, it’s merely correlated with experience. Experienced programmers know how to get their work done in a reasonable amount of time, but inexperienced programmers don’t.

In which case the young programmers’ lack of kids, and “oh, they just enjoy their work”, is just a rationalization: an excuse for skills that aren’t being taught, an excuse for pointless and wasted effort. Those inexperienced programmers waving their hands around aren’t having fun, they’re being eaten by a squid—and no one is helping save them.

Avoiding the kraken

So when you’re interviewing for a job, how do you tell the difference between these two reasons for long hours? Make sure you ask about training and career development.

  • This excellent list of questions to ask during an interview, by Elena Nikoaeleva, suggests asking “how does the company help junior people to grow?”
  • If you’re being interviewed by a manager, ask them for an example of mentoring and teaching they’ve done.
  • If you’re being interviewed by a junior developer, ask them about a time they went off track and took to long to finish a track, to see if they got any help.

Hopefully the answers will demonstrate the less-experienced programmers are taught the skills they need, and helped when they’re floundering. If no, you may wish to avoid this company, especially if you are less experienced. Good luck interviewing, and watch out for krakens!

November 20, 2017 05:00 AM

November 15, 2017

Moshe Zadka

Abstraction Cascade

(This is an adaptation of part of the talk Kurt Rose and I gave at PyBay 2017)

An abstraction cascade is a common anti-pattern in legacy system. It is useful to understand how to recognize it, how it tends to come about, how to fix it -- and most importantly, what kind of things will not fix it. The last one is important, in general, for anti-patterns in legacy systems: if the obvious fix worked, it would have been already dealt with, and would not be a common anti-pattern in legacy systems.

Recognition

The usual pattern for a abstraction cascade looks like complicated, ad-hoc, if/else sequence to decide which path to take. Here is example for a abstraction cascade for finding out a network address corresponding to a name:

def get_address(name):
    if name in services:
        if services[name].ip:
            return service[name].ip, service[name].port
        elif services[name].address:
            # Added for issue #2321
            if ':' in services[name].address:
               return service[name].address.split(':')
            else:
               # Fixes issues #6985
               # TODO: Hotfix, clean-up later
               return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

History

At each step, it seems reasonable to make a specific change. Here is a typical way this kind of code comes about.

The initial version is reasonable: since DNS is a way to publish name to address mapping, why not use a standard?

def get_address(name):
    return dns_lookup(name), DEFAULT_PORT

Under load, an outage happened. There was no time to investigate how to configure DNS caching or TTL better -- so the "popular" services got added to a static list, with a "fast path" checking. This decision also makes sense: when an outage is ongoing, the top priority is to relieve the symptoms.

def get_address(name):
    if name in services:
        # Fixes issues #6985
        # TODO: Hotfix, clean-up later
        return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

However, now the door has opened to add another path in the function. When the need to support multiple services on one host happened, it was easier to just add another path: after all, this was only for new services.

def get_address(name):
    if name in services:
        # Added for issue #2321
        if ':' in services[name].address:
            return service[name].address.split(':')
        else:
            # Fixes issues #6985
            # TODO: Hotfix, clean-up later
            return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

When the change to IPv6 occured, splitting on : was not a safe operation -- so a separate field was added. Again, the existing "new" services (by now, many -- and not so new!) did not need to be touched:

def get_address(name):
    if name in services:
        if services[name].ip:
            return service[name].ip, service[name].port
        elif services[name].address:
            # Added for issue #2321
            if ':' in services[name].address:
               return service[name].address.split(':')
            else:
               # Fixes issues #6985
               # TODO: Hotfix, clean-up later
               return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

Of course, this is typically just chapter one in the real story: having to adapt to multiple data centers, or multiple providers of services, will lead to more and more of these paths -- with nothing thrown away, because "some legacy service depends on it -- maybe".

Non-fixes

Fancier dispatch

Sometimes the ad-hoc if/else pattern is obscured by more abstract dispatch logic: for example, something that loops through classes and finds out which one is the right one:

class AbstractNameFinder(object):
    def matches(self, name):
        raise NotImplementedError()
    def get_address(self, name):
        raise NotImplementedError()
class DNS(AbstractNameFinder):
    def matches(self, name):
        return True
    def get_address(self, name):
        return dns_lookup(name), DEFAULT_PORT
class Local(AbstractNameFinder):
    def matches(self, name):
        return hasattr(services.get(name), 'ip')
    def get_address(self, name):
        return services[name].ip, services[name].port
finders = [Local(), DNS()]
def get_address(name):
    for finder in finders:
        if finder.match(name):
            return finder.get_address(name)

This is actually worse -- now the problem can be spread over multiple files, with no single place to fix it. While the code can be converted to this form, semi-mechanically, this does not fix the underlying issue -- and will actually make the problem continue on with force.

Pareto fix

The Pareto rule is that 80% of the problem is solved with 20% of the effort. It is often the case that a big percentage (in the stereotypical Pareto case, 80%) of the problem is not hard to fix.

For example, most services are actually listed in some file, and all we need to do is read this file in and look up based on that. The incentive to fix "80% of the problem" and leave the "20%" for later is strong.

However, usually the problem is that each of those "Pareto fixes" again makes the problem worse: since it is not a complete replacement, another dispatch layer needs to be built to support the "legacy solution". The new dispatch layer, the new solution, and the legacy solution all become part of the newest iteration of the legacy system, and cause the problem to be even worse.

Fixing 80% of the problem is useful for prototyping, since we are not sure we are solving the right problem and nothing better exists. However, in this case, the complete solution is necessary, so neither of these conditions hold.

Escape strategy

The reason this happens is because no single case can be removed. The way forward is not to add more cases, but to try and remove a single case. The first question to ask is: why was no case removed? Often, the reason is that there is no way to test whether removal is safe.

It might take some work to build infrastructure that will properly make removal safe. Unit tests are often not enough. Integration tests, as well, are sometimes not enough. Sometimes canary systems, sometimes feature flag systems, or, if worst comes to worst, a way to test and roll-back quickly if a problem is found.

Once it is possible to remove just one case (in our example above, maybe check what it would take to remove the case where we split on a colon, since this is clearly worse than just having separate attributes), thought needs to be given to which case is best.

Sometimes, there is more than one case that is really needed: some inherent, deep, trade-off. However, it is rare to need more than two, and almost unheard of to need more than three. Start removing unneeded cases one by one.

Conclusion

When seeing an abstraction cascade, there is a temptation to "clean it up": but most obvious clean-ups end up making it worse. However, by understanding how it came to be, and finding a way to remove cases, it is possible to do away with it.

by Moshe Zadka at November 15, 2017 02:23 AM

November 14, 2017

Moshe Zadka

Gather

Gather is a plugin framework -- and it now has its own blog.

Use it! If you like it, tell us about it, and if there is a problem, tell us about that.

by Moshe Zadka at November 14, 2017 02:23 AM

November 07, 2017

Itamar Turner-Trauring

There's no such thing as bad code

Are you worried that you’re writing bad code? Code that doesn’t follow best practices, code without tests, code that violates coding standards, code you simply don’t want to think about because it’s so very very embarrassing?

In fact, there is no such thing as inherently bad code, or for that matter inherently good code. This doesn’t mean you shouldn’t be judging your code, it’s just that if you’re worrying about whether your code is “good” or “bad” then you’re worrying about the wrong thing.

In this post I will:

  • Demonstrate that “bad” code can be “good” code under different circumstances, and that “best practices” aren’t.
  • Suggest a handy mental exercise to help you move away from thinking in terms of “bad vs good”.

“Bad” code, “good” code

Let’s look at a couple of examples of “bad” code and see that, under some circumstances, this “badness” is irrelevant.

Hard-to-read code

As everyone knows, “good” code is easy to read and follow. You need to choose readable variables, clear function names, and so on and so forth. But then again—

  • If you’re writing code for fun, for your own use, who cares? You are the only one who will ever read this code, so you can choose whatever system makes sense to you so long as you can it understand.
  • If you’re entering the The International Obfuscated C Code Contest, you have a completely different set of criteria. There are some truly amazing entries in past contests; calling them “bad code” is just wrong.
  • If the value your code provided to your organization is sufficiently higher than the cost of maintenance, it can’t really be said to be “bad” code. The world is full of spreadsheets and one-off scripts that are hard-to-read, but nonetheless work correctly and produce huge value.

Unit tests

As everyone knows, “good” code has unit tests, and “bad” code does not. But then again—

  • You’re writing a packaging utility for your product. You have end-to-end tests that make sure the resulting packages install correctly. Unit tests are a waste of time: they prove no additional correctness, and your packaging tool is not your product, it’s just a utility whose code will never be shared elsewhere.
  • You’re building an exploratory prototype. Once you’ve figured out if this idea works you’ll be throwing the code away away and starting from scratch. Why write unit tests?

There’s no such thing as “best practices”

At this point you might be getting a little annoyed. “Yes,” you might say, “these are some exceptions, but for the most part there are standard best practices that everyone can and should follow.” Consider:

  • Formal verification is a practical and useful technique for finding bugs in your code. See for example how AWS uses formal verification to validate their services.
  • NASA’s software development practices are amazing, both effective and expensive. One quote to give you a flavor: “The specs for the [shuttle] program fill 30 volumes and run 40,000 pages.”

Both NASA’s techniques and formal verification lead to far less defects. So should they be best practices? It depends: if you’re building the website for a local chain of restaurants, using these techniques is ridiculous. You’ll ship a year late and 10× over budget. On the other hand, if you’re writing the software for a heart monitor, or a spacecraft that’s going to Pluto, “I wrote some unit tests!” is very definitely not best practices.

A sanity check for your code

Instead of feeling embarrassed about your “bad” code, or proud of your “good” code, you should judge your code by how well it succeeds at achieving your goals. Whether you’re competing in the International Obfuscated C Code Contest or working on NASA’s latest mission, you need to use techniques suitable to your particular situation and goal. Judging one’s code can be a little tricky, of course: it’s easy to miss the ways in which you’ve failed, easy to underestimate the ways in which you’ve succeeded.

So try this exercise to give yourself some perspective: every time you write some code figure out the tradeoffs you’re making. That is, identify the circumstances and goals for which your current practices are insufficient, and those for which your current practices are overkill. If you can’t come up with an answer, if your code seems suitable for all situations, that is a danger sign: there’s always a tradeoff, even if you can’t see it.

Finally, when you encounter someone else’s code, be kind: don’t tell them their code is “bad”. Instead, go through the same exercise with them. Figure out their goals, and then walk through the tradeoffs involved in how they’ve written their code. This is a far more useful way of improving their code, and can help you understand why you make the decisions you do.

November 07, 2017 05:00 AM

October 23, 2017

Glyph Lefkowitz

Careful With That PyPI

Too Many Secrets

A wise man once said, “you shouldn’t use ENV variables for secret data”. In large part, he was right, for all the reasons he gives (and you should read them). Filesystem locations are usually a better operating system interface to communicate secrets than environment variables; fewer things can intercept an open() than can read your process’s command-line or calling environment.

One might say that files are “more secure” than environment variables. To his credit, Diogo doesn’t, for good reason: one shouldn’t refer to the superiority of such a mechanism as being “more secure” in general, but rather, as better for a specific reason in some specific circumstance.

Supplying your PyPI password to tools you run on your personal machine is a very different case than providing a cryptographic key to a containerized application in a remote datacenter. In this case, based on the constraints of the software presently available, I believe an environment variable provides better security, if you use it correctly.

Popping A Shell By Any Other Name

If you upload packages to the python package index, and people use those packages, your PyPI password is an extremely high-privilege credential: effectively, it grants a time-delayed arbitrary code execution privilege on all of the systems where anyone might pip install your packages.

Unfortunately, the suggested mechanism to manage this crucial, potentially world-destroying credential is to just stick it in an unencrypted file.

The authors of this documentation know this is a problem; the authors of the tooling know too (and, given that these tools are all open source and we all could have fixed them to be better about this, we should all feel bad).

Leaving the secret lying around on the filesystem is a form of ambient authority; a permission you always have, but only sometimes want. One of the worst things about this is that you can easily forget it’s there if you don’t use these credentials very often.

The keyring is a much better place, but even it can be a slightly scary place to put such a thing, because it’s still easy to put it into a state where some random command could upload a PyPI release without prompting you. PyPI is forever, so we want to measure twice and cut once.

Luckily, even more secure places exist: password managers. If you use https://1password.com or https://www.lastpass.com, both offer command-line interfaces that integrate nicely with PyPI. If you use 1password, you’ll really want https://stedolan.github.io/jq/ (apt-get install jq, brew install jq) to slice & dice its command-line.

The way that I manage my PyPI credentials is that I never put them on my filesystem, or even into my keyring; instead, I leave them in my password manager, and very briefly toss them into the tools that need them via an environment variable.

First, I have the following shell function, to prevent any mistakes:

1
2
3
4
function twine () {
    echo "Use dev.twine or prod.twine depending on where you want to upload.";
    return 1;
}

For dev.twine, I configure twine to always only talk to my local DevPI instance:

1
2
3
4
5
6
function dev.twine () {
    env TWINE_USERNAME=root \
        TWINE_PASSWORD= \
        TWINE_REPOSITORY_URL=http://127.0.0.1:3141/root/plus/ \
        twine "$@";
}

This way I can debug Twine, my setup.py, and various test-upload things without ever needing real credentials at all.

But, OK. Eventually, I need to actually get the credentials and do the thing. How does that work?

1Password

1password’s command line is a little tricky to log in to (you have to eval its output, it’s not just a command), so here’s a handy shell function that will do it.

1
2
3
4
5
6
function opme () {
    # Log this shell in to 1password.
    if ! env | grep -q OP_SESSION; then
        eval "$(op signin "$(jq -r '.latest_signin' ~/.op/config)")";
    fi;
}

Then, I have this little helper for slicing out a particular field from the OP JSON structure:

1
2
3
function _op_field () {
    jq -r '.details.fields[] | select(.name == "'"${1}"'") | .value';
}

And finally, I use this to grab the item I want (named, memorably enough, “PyPI”) and invoke Twine:

1
2
3
4
5
6
7
function prod.twine () {
    opme;
    local pypi_item="$(op get item PyPI)";
    env TWINE_USERNAME="$(echo ${pypi_item} | _op_field username)" \
        TWINE_PASSWORD="$(echo "${pypi_item}" | _op_field password)" \
        twine "$@";
}

LastPass

For lastpass, you can just log in (for all shells; it’s a little less secure) via lpass login; if you’ve logged in before you often don’t even have to do that, and it will just prompt you when running command that require you to be logged in; so we don’t need the preamble that 1password’s command line did.

Its version of prod.twine looks quite similar, but its plaintext output obviates the need for jq:

1
2
3
4
5
function prod.twine () {
    env TWINE_USERNAME="$(lpass show PyPI --username)" \
        TWINE_PASSWORD="$(lpass show PyPI --password)" \
        twine "$@";
}

In Conclusion

“Keep secrets out of your environment” is generally a good idea, and you should always do it when you can. But, better a moment in your process environment than an eternity on your filesystem. Environment-based configuration can be a very useful stopgap for limiting the lifetimes of credentials when your tools don’t support more sophisticated approaches to secret storage.1

Post Script

If you are interested in secure secret storage, my micro-project secretly might be of interest. Right now it doesn’t do a whole lot; it’s just a small wrapper around the excellent keyring module and the pinentry / pinentry-mac password prompt tools. secretly presents an interface both for prompting users for their credentials without requiring the command-line or env vars, and for saving them away in keychain, for tools that need to pull in an API key and don’t want to make the user manually edit a config file first.


  1. Really, PyPI should have API keys that last for some short amount of time, that automatically expire so you don’t have to freak out if you gave somebody a 5-year-old laptop and forgot to wipe it first. But again, if I wanted that so bad, I should have implemented it myself... 

by Glyph at October 23, 2017 05:10 AM

Itamar Turner-Trauring

Your technical skills are obsolete: now what?

One day you go to work and discover your technical skills are obsolete:

  • The programming language you know best has been declining in popularity for a decade.
  • The web framework you know best has been completely changed in v2, and rewritten in another language for good measure.
  • Job postings are stomach-churning lists of tools you’ve never used, or even heard of.
  • You’re asked to compare two technologies, and you don’t know where to start.

You feel like your growth has been stunted: there are all these skills you should have been learning, but you never did because you didn’t need them at work. Your coworkers seem to know all about the latest tools and you don’t, and eventually, maybe soon, you’ll just be left behind.

What should you do? How can you get out of this mess and salvage your career?

I’m not going to say “code in your spare time”, because that’s not possible for many people. And while I believe it’s completely possible to keep your skills up-to-date as part of your job (e.g. I’ve written about having a broad grasp of available technology and practicing on the job), the assumption at this point is that you haven’t done so.

Here are your goals, then:

  1. Get your technical skills up to speed, and quickly.
  2. Do it all during work hours.
  3. End up looking good to your manager.

In this post I’ll explain one way to do so, which involves:

  • Understanding the organizational dynamic that causes organizations to use out-of-date technology.
  • Actively seeking out these problematic technologies, and then improving the situation both for your organization and for you.

Why you’re using old technology

Most programmers, probably including you, work on existing software projects. Existing software projects tend to use older technology. The result: you are likely to be using older, out-of-date technology, rather than the latest and (speculatively) greatest.

When a project gets started the programmers in charge pick the best technology they know of, and hope for the best. After that initial set of choices most projects stick to their current technology set, only slowly upgrading over time. Of course, there’s always new technologies coming out that claim to be better than existing ones.

Updating an existing project to new technologies is difficult, which means changes are often put off until it’s truly necessary. It takes effort, ranging from a small effort to upgrade to a newer version of a library version, to a large effort for changing infrastructure like version control, to an enormous effort if you want to switch to a new programming language. So even when a clearly superior and popular technology becomes available, the cost of switching may not be worth it. Time spent switching technologies is time that could be spent shipping features, after all.

Two real-world examples of this dynamic:

  • The Twisted open source project was started in 2000, and used the CVS version control system. Subversion, a superior version control system, was released the same year, and soon new open source projects defaulted to using it. Eventually Twisted switched to Subversion. And then Git was released, and eventually when it was clear Git was winning Twisted switched to Git. In both cases Twisted’s adoption lagged behind new projects, which could easily start off using a better VCS since they didn’t have to pay the cost of upgrading existing infrastructure.
  • A large company, founded in 1997, built the initial version of their software in Perl. At the time Perl was a very popular programming language. Over the past 20 years Perl’s mindshare has shrunk: it’s harder to hire people with Perl knowledge, and there’s less 3rd party libraries being developed. So the company is stuck with a massive working application written in an unpopular language; the cost of switching to a new language is still too high.

The switch to new technology

Eventually the cost of sticking with an old technology becomes too high, and management starts putting the resources into upgrading. In a well-run company this happens on a regular, on-going basis, and management will have spent the resources to keep programmers’ skills up-to-date. In these companies learning about new technologies, and then deciding which are worth the cost of adopting, will be an ongoing activity.

In many companies, however, the cost of keeping programmer skills up-to-date is dumped on to you. You are expected to spend your spare time researching new programming languages, tools, and techniques. If you enjoy doing that, great. If you don’t, your technical knowledge will stagnate, at some cost to the company, but even more so to you.

Helping your project, upgrading your skills

If you find yourself in this situation then you can turn your project’s out-of-date technology into a learning opportunity for you. Technology’s purpose is to solve business problems: you need identify business problems where your current technology isn’t working well, and try to solve those problems. This will allow you to research and learn new technologies while helping your project improve.

Specifically, you should:

  1. Identify obsolete and problematic technologies.
  2. Identify potential replacements.
  3. Convince your manager that this is a problem that merits further resources. Your goal is to get the time to build a proof-of-concept or pilot project where you can expand your understanding of a relevant, useful new technology.

If all goes well you’ll have both demonstrated your value to your manager, and been given the time to learn a new technology at work. But even if you fail to convince your manager, you’ll have an easier time when it comes to interviewing at other jobs, and some sense of which technologies are worth learning.

Let’s go through these steps one by one.

1. Identify obsolete and problematic technologies

Your project is likely using many out-of-date technologies: you want to find one that is both expensive to your project, and not too expensive to replace. Since you’re going to have to convince your manager to put some resources into the project, you want to have some clear evidence that an obsolete technology is costing the company.

Look for things like:

  • Technologies that are shrinking in popularity; some Google searches for “$YOURTECH popularity” can help you determine this, as can Google Trends.
  • Repetitive work, where you have to manually do the same task over and over again.
  • Trouble hiring people with pre-existing knowledge.
  • The system everyone knows is broken: it crashes all the time, say, or corrupts data.

You can do this while you work: just look for signs of trouble as you go about your normal business.

2. Identify potential replacements

Once you’ve identified a problem technology, you need to find a plausible replacement. It’s likely that any problem you have is not a rare problem: someone else has had this issue, so someone else has probably come up with a solution. In fact, chances are there are multiple solutions. You just need to find a reasonable one.

You should:

  1. Figure out the keywords that describe this technology. For example, if you need to line up images automatically the term of art is “image registration”. If you want to run a series of long-running processes in a row the terms of art are “workflow management”, “batch processing”, “data pipelines”, and other terms. You can do this by reading the documentation for some known solution, talking to a colleague with broader technical knowledge, or some search engine iteration.
  2. Once you have the keywords, you can start finding solutions. The documentation for a tool will often provide more keywords with which to extend your search, and will often mention competitors.
  3. Search engine queries can also find more alternatives, e.g. search for “$SOMETECH alternatives” or look at the Google search auto-completes for “$SOMETECH vs”.
  4. Once you have found a number of alternatives, get a sense of what the top contender or contenders are. Criteria might include complexity, maturity, risk (is it developed only by a startup?), features, popularity, and so on. At the end of this post you can sign up to get a PDF with my personal process for evaluating technologies.

The goal is become aware of the range of technologies available, and get a superficial understanding of theirs strengths (“React has a bigger community, but skimming the Vue tutorial was easier”, for example). This process can therefore be done over the course of a few hours, at most a day or two, during your work day in-between scheduled tasks.

Remember, you can always rope in a colleague with broader technical knowledge to help out: the goal is to improve things for your project, after all.

3. Getting management buy-in

At this point you should have:

  1. Identified a problem area.
  2. Identified a technology or three that might solve this problem.

Next you need to convince your manager that it’s worth the cost of trying out a new technology. In particular you need to:

  • Demonstrate there’s a real problem.
  • Suggest a potential solution that will solve the problem.
  • Give some evidence this solution is reasonable, and not too expensive.
  • Suggest a next step that will allow you to investigate further (and learn more!). Ideally, some sort of pilot project where you can try the technology out on a small scale and see what it’s like using it in practice.

Your pitch might go something like this:

Demonstrate the problem: “Hey, you know how we have all these problems with Foobar, where we spend all our time redoing the formatting instead of working on actual features?”

Suggest a solution: “I’ve been looking into it and I think the problem is that we’re using a pretty old library; it turns out there’s some newer tools that could make things easier.”

Evidence for solution: “For example the Format.js library, it’s got a huge userbase, and from the docs it’s pretty easy to use, and see this example, that’s exactly what we waste all our time on doing manually!”

Next step: “So, how about for that small project we’re doing next month I try this out instead of our usual Foobar setup, and see how it goes?”

If your manager agrees: success! You now have time to learn a new technology in depth, on the job.

If your manager doesn’t agree, all is not lost. You’ve gained awareness of what newer technologies are available; you might spend a little spare time here and there at work learning it more in depth. And when you next interview for a job, you’ll have some sense of technologies to either brush up on, or at least to mention during the interview: “Formatting? Oh, we used Foobar, which is pretty bad because X, Y and Z. But I did some research and found Format.js and it seemed a lot better because A, B and C. So that’s what I’d use in the future.”

Don’t just be a problem solver

The process I describe above is just one approach; no doubt there are others. The key skill involved, however, can’t be replaced: learning how to identify problems is critical to your success as a programmer.

As a junior programmer you get handed a solution, and then you go off and implement it. When you’re more experienced you get handed problems, and come up with solutions on your own: you become a problem solver. In many ways this is an improvement, both in your skills and in your value as an employee, but it’s all a dangerous place to be.

If you’re a junior programmer no one expects much of you, but once you’re past that point expectations rise. And if you’re only a problem solver, then you’re at the mercy of whoever has the job of identifying problems. If they fail to identify an important problem, like the use of an old technology, or decide your career isn’t a problem worth worrying about, then you might find yourself in trouble: working on a failed project, or lacking the skills you need.

Don’t just be a problem solver: learn how to identify problems on your own. Every day when you go to work, every time you look at some code, every time you see a bug report or a feature request, every time you feel bored, every time someone complains, ask yourself: “What is the problem here?” As you learn to identify problems, you’ll start recognizing obsolete technology. As you learn to identify problems, you’ll start noticing the limits of your own skills and your current career choices. You’ll become a more valuable employee, and you’ll become more effective at achieving your own goals.

(Don’t see a form? You may need to disable Privacy Badger temporarily).

October 23, 2017 04:00 AM

Hynek Schlawack

Sharing Your Labor of Love: PyPI Quick and Dirty

A completely incomplete guide to packaging a Python module and sharing it with the world on PyPI.

by Hynek Schlawack (hs@ox.cx) at October 23, 2017 12:00 AM

October 20, 2017

Jonathan Lange

Category theory in everyday life

I was going to write a post about how knowing some abstract algebra can help you write clearer programs.

Then I saw Eugenia Cheng’s excellent talk, Category Theory in Everyday Life, which was a keynote at Haskell Exchange 2017.

It’s excellent. She says what I wanted to say much better than I could, and says many more things that I wouldn’t have thought to say at all. You should watch it.

The talk assumes very little technical or mathematical knowledge, and certainly no knowledge of Haskell.

by Jonathan Lange at October 20, 2017 11:00 PM

October 12, 2017

Jonathan Lange

SPAKE2 in Haskell: How Haskell Helped

Porting SPAKE2 from Python to Haskell helped me understand how SPAKE2 worked, and a large part of that is due to specific features of Haskell.

What’s this again?

As a favour for Jean-Paul, I wrote a Haskell library implementing SPAKE2, so he could go about writing a magic-wormhole client. This turned out to be much more work than I expected. Although there was a perfectly decent Python implementation for me to crib from, my ignorance of cryptography and the lack of standards documentation for SPAKE2 made it difficult for me to be sure I was doing the right thing.

One of the things that made it easier was the target language: Haskell. Here’s how.

Elliptic curves—how do they work?

The arithmetic around elliptic curves can be slow. There’s a trick where you can do the operations in 4D space, rather than 2D space, which somehow makes the operations faster. Brian’s code calls these “extended points”. The 2D points are called “affine points”.

However, there’s a catch. Many of the routines can generate extended points that aren’t on the curve for that we’re working in, which makes them useless (possibly dangerous) for our cryptography.

The Python code deals with this using runtime checks and documentation. There are many checks of isoncurve, and comments like extended->extended.

Because I have no idea what I’m doing, I wanted to make sure I got this right.

So when I defined ExtendedPoint, I put whether or not the point is on the curve (in the group) into the type.

e.g.

-- | Whether or not an extended point is a member of Ed25519.
data GroupMembership = Unknown | Member

-- | A point that might be a member of Ed25519.
data ExtendedPoint (groupMembership :: GroupMembership)
  = ExtendedPoint
  { x :: !Integer
  , y :: !Integer
  , z :: !Integer
  , t :: !Integer
  } deriving (Show)

This technique is called phantom types.

It means we can write functions with signatures like this:

isExtendedZero :: ExtendedPoint irrelevant -> Bool

Which figures out whether an extended point is zero, and we don’t care whether it’s in the group or not.

Or functions like this:

doubleExtendedPoint
  :: ExtendedPoint preserving
  -> ExtendedPoint preserving

Which says that whether or not the output is in the group is determined entirely by whether the input is in the group.

Or like this:

affineToExtended
  :: AffinePoint
  -> ExtendedPoint 'Unknown

Which means that we know that we don’t know whether a point is on the curve after we’ve projected it from affine to extended.

And we can very carefully define functions that decide whether an extended point is in the group or not, which have signatures that look like this:

ensureInGroup
  :: ExtendedPoint 'Unknown
  -> Either Error (ExtendedPoint 'Member)

This pushes our documentation and runtime checks into the type system. It means the compiler will tell me when I accidentally pass an extended point that’s not a member (or not proven to be a member) to something that assumes it is a member.

When you don’t know what you are doing, this is hugely helpful. It can feel a bit like a small child trying to push a star-shaped thing through the square-shaped hole. The types are the holes that guide how you insert code and values.

What do we actually need?

Python famously uses “duck typing”. If you have a function that uses a value, then any value that has the right methods and attributes will work, probably.

This is very useful, but it can mean that when you are trying to figure out whether your value can be used, you have to resort to experimentation.

inbound_elem = g.bytes_to_element(self.inbound_message)
if inbound_elem.to_bytes() == self.outbound_message:
   raise ReflectionThwarted
pw_unblinding = self.my_unblinding().scalarmult(-self.pw_scalar)
K_elem = inbound_elem.add(pw_unblinding).scalarmult(self.xy_scalar)

Here, g is a group. What does it need to support? What kinds of things are its elements? How are they related?

Here’s what the type signature for the corresponding Haskell function looks like:

generateKeyMaterial
  :: AbelianGroup group
  => Spake2Exchange group  -- ^ An initiated SPAKE2 exchange
  -> Element group  -- ^ The outbound message from the other side (i.e. inbound to us)
  -> Element group -- ^ The final piece of key material to generate the session key.

This makes it explicit that we need something that implements AbelianGroup, which is an interface with defined methods.

If we start to rely on something more, the compiler will tell us. This allows for clear boundaries.

When reverse engineering the Python code, it was never exactly clear whether a function in a group implementation was meant to be public or private.

By having interfaces (type classes) enforced by the compiler, this is much more clear.

What comes first?

The Python SPAKE2 code has a bunch of assertions to make sure that one method isn’t called before another.

In particular, you really shouldn’t generate the key until you’ve generated your message and received one from the other side.

Using Haskell, I could put this into the type system, and get the compiler to take care of it for me.

We have a function that initiates the exchange, startSpake2:

-- | Initiate the SPAKE2 exchange. Generates a secret (@xy@) that will be held
-- by this side, and transmitted to the other side in "blinded" form.
startSpake2
  :: (AbelianGroup group, MonadRandom randomly)
  => Spake2 group
  -> randomly (Spake2Exchange group)

This takes a Spake2 object for a particular AbelianGroup, which has our password scalar and protocol parameters, and generates a Spake2Exchange for that group.

We have another function that computes the outbound message:

-- | Determine the element (either \(X^{\star}\) or \(Y^{\star}\)) to send to the other side.
computeOutboundMessage
  :: AbelianGroup group
  => Spake2Exchange group
  -> Element group

This takes a Spake2Exchange as its input. This means it is _impossible_ for us to call it unless we have already called startSpake2.

We don’t need to write tests for what happens if we try to call it before we call startSpake2, in fact, we cannot write such tests. They won’t compile.

Psychologically, this helped me immensely. It’s one less thing I have to worry about getting right, and that frees me up to explore other things.

It also meant I had to do less work to be satisfied with correctness. This one line type signature replaces two or three tests.

We can also see that startSpake2 is the only thing that generates random numbers. This means we know that computeOutboundMessage will always return the same element for the same initiated exchange.

Conclusion

Haskell helped me be more confident in the correctness of my code, and also gave me tools to explore the terrain further.

It’s easy to think of static types as being a constraint the binds you and prevents you from doing wrong things, but an expressive type system can help you figure out what code to write.

by Jonathan Lange at October 12, 2017 11:00 PM

October 10, 2017

Itamar Turner-Trauring

The lone and level sands of software

There’s that moment late at night when you can’t sleep, and you’re so tired you can’t even muster the energy to check the time. So you stare blindly at the ceiling and look back over your life, and you think: “Did I really accomplish anything? Was my work worth anything at all?”

I live in a 140-year-old house, a house which has outlasted its architect and builders, and quite possibly will outlast me. But having spent the last twenty years of my life building software, I can’t really hope to have my own work live on. In those late night moments I sometimes believe that my resume, like that of most programmers, should open with a quote from Shelley’s mocking poem:

My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.

Who among us has not had projects canceled, rewritten from scratch, obsoleted, abandoned or discarded? Was that code worth writing, or was all that effort just a futile waste?

Decay, boundless and bare

Consider some of the projects I’ve worked on. I’ve been writing software for 20+ years at this point, which means I’ve accumulated many decayed wrecks:

  • The multimedia CD-ROMs I created long ago no longer run on modern operating systems, not so much because of Microsoft but because of my own design mistake.
  • The dot-com I worked for turned out to be a dot-bomb.
  • An offline educational platform turned out, on reflection by the customer, not to require offline capabilities. It was rewritten (by someone else) as a simpler web app.
  • The airline reservation project I was small part of, a massive and rarely undertaken project, finally went live on a small airline. Google, which had acquired the company that built it, shut the project down a couple of years later. Some parts were used elsewhere and lived on, but I’m told that they have since been rewritten; by now the legacy software has probably been decommissioned.
  • Projects done for startups… gone down with the company, or abandoned by a pivot, or surviving in zombie form as unmaintained open source.

I could go on, but that would just make me sadder. This is not to say none of my software lives on: there are open source projects, mostly, that have survived quite a whole, and will hopefully continue for many more. But I’ve spent years of my life working on software that is dead and gone.

How about you? How much of your work has survived?

Which yet survive

So what do you have left, after all these years of effort? You get paid for your work, of course, and getting paid has its benefits. And if you’re lucky your software proved valuable to someone, for a while at least, before it was replaced or shut down. For me at least that’s worth even more than the money.

But there’s something else you gain, something you get to take with you when the money is spent and your users have moved on: knowledge, skills, and mistakes you’ll know how to avoid next time. Every failure I’ve listed above, every mistake I’ve made, every preventable rewrite, is something I hope to avoid the next time around.

And while software mostly dies quickly, the ideas live on, and if we pay attention it’ll be the good ideas that survive. I’ve borrowed ideas for my own logging library from software that is now dead. If my library dies one day, and no doubt it will, I can only hope its own contributions will be revived by one of my users, or even someone who just half-remembers a better way of doing things.

Dead but not forgotten

Since the ultimate benefit of most software projects is what you learned from them, it’s important to make sure you’re actually learning. It’s easy to just do your work and move on. If you’re not careful you’ll forget to look for the mistakes to avoid next time, and you won’t notice the ideas that are the only thing that can truly survive in the long run.

  • Every month or two, take a look at what you’ve been working on, and ask yourself: “Am I learning something new?” If you aren’t, it’s time for a change: perhaps just a bit of introspection to see what there is to be learned, perhaps a new project, maybe even a new job.
  • If you have learned something, ask yourself if you’ve ensured that this knowledge is passed on to others, so they can gain something from it.

As for me, I’ve been writing a weekly newsletter where I share my mistakes, some mentioned above, others in my current work: you can gain from my failures, without all the wasted effort.

October 10, 2017 04:00 AM

October 04, 2017

Itamar Turner-Trauring

Technical skills alone won't make you productive

When you’re just starting out in your career as a programmer, the variety and number of skills you think you need to learn can be overwhelming. And working with colleagues who can produce far more than you can be intimidating, demoralizing, and confusing: how do they do it? How can some programmers create so much more?

The obvious answer is that these productive programmers have technical skills. They know more programming languages, more design patterns, more architectural styles, more testing techniques. And all these do help: they’ll help you find a bug faster, or implement a solution that is more elegant and efficient.

But the obvious answer is insufficient: technical skills are necessary, but they’re not enough, and they often don’t matter as much as you’d think. Productivity comes from avoiding unnecessary work, and unnecessary work is a temptation you’ll encounter long before you reach the point of writing code.

In this post I’m going to cover some of the ways you can be unproductive, from most to least unproductive. As you’ll see, technical programming skills do help, but only much further along in the process of software development.

How to be unproductive

1. Destructive work

The most wasteful and unproductive thing you can do is work on something that hurts others, or that you personally think is wrong. Instead of creating, you’re actively destroying. Instead of making the world a better place, you’re making the world a worse place. The better you are at your job, the less productive you are.

Being productive, then, starts with avoiding destructive work.

2. Work that doesn’t further your goals

You go to work every day, and you’re bored. You’re not learning anything, you’re not paid well, you don’t care one way or another about the results of your work… why bother at all?

Productivity can only be defined against a goal: you’re trying to produce some end result. If you’re working on something that doesn’t further your own goals—making money, learning, making the world a better place—then this work isn’t productive for you.

To be productive, figure out your own goals, and then find work that will align your goals with those of your employer.

3. Building something no one wants

You’re working for a startup, and it’s exciting, hard work, churning out code like there’s no tomorrow. Finally, the big day comes: you launch your product to great fanfare. And then no one shows up. Six months later the company shuts down, and you’re looking for a new job.

This failure happens at big companies too, and it happens to individuals building their side project: building a product that the world doesn’t need. It doesn’t matter how good a programmer you are: if you’re working on solving a problem that no one has, you’re doing unnecessary work.

Personally, I’ve learned a lot from Stacking the Bricks about how to avoid this form of unproductive work.

4. Running out of time

Even if you’re working on a real problem, on a problem you understand well, your work is for naught if you fail to solve the problem before you run out of time or money. Technical skills will help you come up with a faster, simpler solution, but they’re not enough. You also need to avoid digressions, unnecessary work that will slow you down.

The additional skills you need here are project planning skills. For example:

  • Prioritization, figuring out what is most important.
  • Timeboxing, setting timeouts for your work after which you stop and reassess your situation, e.g. by asking for help.
  • Planning, working out the critical path from where you want to be to where you are now.

5. Solving the symptoms of a problem, instead of the root cause

Finally, you’ve gotten to the point of solving a problem! Unfortunately, you haven’t solved the root cause because you haven’t figured out why you’re doing your work. You’ve added a workaround, instead of discovering the bug, or you’ve made a codepath more efficient, when you could have just ripped out an unused feature altogether.

Whenever you’re given a task, ask why you’re doing it, what success means, and keep digging until you’ve found the real problem.

6. Solving a problem inefficiently

You’ve solved the right problem, on time and on budget! Unfortunately, your design wasn’t as clean and efficient as it could have been. Here, finally, technical skills are the most important skills.

Beyond technical skills

If you learn the skills you need to be productive—starting with goals, prioritizing, avoiding digressions, and so on—your technical skills will also benefit. Learning technical skills is just another problem to solve: you need to learn the most important skills, with a limited amount of time. When you’re thinking about which skills to learn next, take some time to consider which skills you’re missing that aren’t a programming language or a web framework.

Here’s one suggestion: during my 20+ years as a programmer I’ve made all but the first of the mistakes I’ve listed above. You can hear these stories, and learn how to avoid my mistakes, by signing up for my weekly Software Clown email.

October 04, 2017 04:00 AM

September 28, 2017

Moshe Zadka

Brute Forcing AES

Thanks to Paul Kehrer for reviewing! Any mistakes or oversights that are left are my responsibility.

AES's maximum key size is 256 bits (there are also 128 and 192 bit versions available). Is that enough? Well, if there is a cryptographic flaw in AES (i.e., a way to recover some bits of the key by some manipulation that takes less than 2**256 operations), then it depends on how big the flaw is. All algorithms come with the probablistic "flaw" that, on average, only 50% of the keys need to be tested -- since the right key is just as easily in the first half as the second half. This means, on average, just 2**255 operations are needed to check "all" keys.

If there is an implementation flaw in your AES implementation, then it depends on the flaw -- most implementation flaws are "game over". For example, if the radio leakage from the CPU is enough to detect key bits, the entire key can be recovered -- but that would be true (with only minor additional hardship) if the key was 4K bit long. Another example is a related subkey attack, where many messages are encrypted with keys that have a certain relationship to each other (e.g., sharing a prefix). This implementation flaw (in a different encryption algorithm) defeated the WEP WiFi standard.

What if there is none? What if actually recovering a key requires checking all possibilities? Can someone do it, if they have a "really big" computer? Or a $10B data-center?

How much is 256-bit security really worth?

Let's see!

We'll be doing a lot of unit conversions, so we bring in the pint library, and create a new unit registry.

import pint
REGISTRY = pint.UnitRegistry()

Assume we have a really fast computer. How fast? As fast as theoretically possible, or so. The time it takes a photon to cross the nucleus of the hydrogen atom (a single proton) is called a "jiffy". (If someone tells you they'll be back in a jiffy, they're probably lying -- unless they're really fast, and going a very short distance!)

REGISTRY.define('jiffy = 5.4*10**-44 seconds')

Some secrets are temporary. Your birthday surprise party is no longer a secret after your friends yell "surprise!". Some secrets are long-lived. The British kept the secret of the broken Enigma until none were in use -- long after WWII was done.

Even the Long Now Foundation, though, does not have concrete plans post-dating the death of our sun. No worries, unless the Twisted gets more efficient, the cursed orb has got a few years on it.

sun_life = 10**10 * REGISTRY.years

With our super-fast computer, how many ticks do we get until the light of the sun shines no longer...

ticks = sun_life.to('jiffy').magnitude

...and how many do we need to brute-force AES?

brute_force_aes = 2**256

Luckily, brute-force parallelises really well: just have each computer check a different part of the key-space. We have fast computer technology, and quite a while, so how many do we need?

parallel = brute_force_aes / ticks

No worries! Let's just take over the US, and use its entire Federal budget to finance our computers.

US_budget = 4 * 10**12

Assume our technology is cheap -- maintaining each computer, for the entire lifetime of the sun, costs a mere $1.

Do we have enough money?

parallel/US_budget
4953.566155198452

Oh, we are only off by a factor of about 5000. We just need the budget of 5000 more countries, about as wealthy as the US, in order to fund our brute-force project.

Again, to be clear, none of this is a cryptographic analysis of AES -- but AES is the target of much analysis, and thus far, no theoretical flaw has been found that gives more than a bit or two. Assuming AES is secure, and assuming the implementation has no flaws, brute-forcing AES is impossible -- even with alien technology, plenty of time and access to quite a bit of the world's wealth.

by Moshe Zadka at September 28, 2017 05:50 AM

September 24, 2017

Jp Calderone

Finishing the txflashair Dockerfile

Some while ago I got a new wifi-capable camera. Of course, it has some awful proprietary system for actually transferring images to a real computer. Fortunately, it's all based on a needlessly complex HTTP interface which can fairly easily be driven by any moderately capable HTTP client. I played around with FlashAero a bit first but it doesn't do quite what I want out of the box and the code is a country mile from anything I'd like to hack on. It did serve as a decent resource for the HTTP interface to go alongside the official reference which I didn't find until later.

Fast forward a bit and I've got txflashair doing basically what I want - essentially, synchronizing the contents of the camera to a local directory. Great. Now I just need to deploy this such that it will run all the time and I can stop thinking about this mundane task forever. Time to bust out Docker, right? It is 2017 after all.

This afternoon I took the Dockerfile I'd managed to cobble together in the last hack session:


FROM python:2-alpine

COPY . /src
RUN apk add --no-cache python-dev
RUN apk add --no-cache openssl-dev
RUN apk add --no-cache libffi-dev
RUN apk add --no-cache build-base

RUN pip install /src

VOLUME /data

ENTRYPOINT ["txflashair-sync"]
CMD ["--device-root", "/DCIM", "--local-root", "/data", "--include", "IMG_*.JPG"]
and turn it into something halfway to decent and that produces something actually working to boot:

FROM python:2-alpine

RUN apk add --no-cache python-dev
RUN apk add --no-cache openssl-dev
RUN apk add --no-cache libffi-dev
RUN apk add --no-cache build-base
RUN apk add --no-cache py-virtualenv
RUN apk add --no-cache linux-headers

RUN virtualenv /app/env

COPY requirements.txt /src/requirements.txt
RUN /app/env/bin/pip install -r /src/requirements.txt

COPY . /src

RUN /app/env/bin/pip install /src

FROM python:2-alpine

RUN apk add --no-cache py-virtualenv

COPY --from=0 /app/env /app/env

VOLUME /data

ENTRYPOINT ["/app/env/bin/txflashair-sync"]
CMD ["--device-root", "/DCIM", "--local-root", "/data", "--include", "IMG_*.JPG"]

So, what have I done exactly? The change to make the thing work is basically just to install the missing py-virtualenv. It took a few minutes to track this down. netifaces has this as a build dependency. I couldn't find an apk equivalent to apt-get build-dep but I did finally track down its APKBUILD file and found that linux-headers was probably what I was missing. Et voila, it was. Perhaps more interesting, though, are the changes to reduce the image size. I began using the new-ish Docker feature of multi-stage builds. From the beginning of the file down to the 2nd FROM line defines a Docker image as usual. However, the second FROM line starts a new image which is allowed to copy some of the contents of the first image. I merely copy the entire virtualenv that was created in the first image into the second one, leaving all of the overhead of the build environment behind to be discarded.

The result is an image that only has about 50MiB of deltas (compressed, I guess; Docker CLI presentation of image/layer sizes seems ambiguous and/or version dependent) from the stock Alphine Python 2 image. That's still pretty big for what's going on but it's not crazy big - like including all of gcc, etc.

The other changes involving virtualenv are in support of using the multi-stage build feature. Putting the software in a virtualenv is not a bad idea in general but in this case it also provides a directory containing all of the necessary txflashair bits that can easily be copied to the new image. Note that py-virtualenv is also copied to the second image because a virtualenv does not work without virtualenv itself being installed, strangely.

Like this kind of thing? Check out Supporing Open Source on the right.

by Jean-Paul Calderone (noreply@blogger.com) at September 24, 2017 07:15 PM

September 23, 2017

Glyph Lefkowitz

Photo Flow

Hello, the Internet. If you don’t mind, I’d like to ask you a question about photographs.

My spouse and I both take pictures. We both anticipate taking more pictures in the near future. No reason, just a total coincidence.

We both have iPhones, and we both have medium-nice cameras that are still nicer than iPhones. We would both like to curate and touch up these photos and actually do something with them; ideally we would do this curation collaboratively, whenever either of us has time.

This means that there are three things we want to preserve:

  1. The raw, untouched photographs, in their original resolution,
  2. The edits that have been made to them, and
  3. The “workflow” categorization that has been done to them (minimally, “this photo has not been looked at”, “this photo has been looked at and it’s not good enough to bother sharing”, “this photo has been looked at and it’s good enough to be shared if it’s touched up”, and “this has been/should be shared in its current state”). Generally speaking this is a “which album is it in” categorization.

I like Photos. I have a huge photo library with years of various annotations in it, including faces (the only tool I know of that lets you do offline facial recognition so you can automatically identify pictures of your relatives without giving the police state a database to do the same thing).

However, iCloud Photo Sharing has a pretty major issue; it downscales photographs to “up to 2048 pixels on the long edge”, which is far smaller even than the 12 megapixels that the iPhone 7 sports; more importantly it’s lower resolution than our television, so the image degradation is visible. This is fine for sharing a pic or two on a small phone screen, but not good for a long-term archival solution.

To complicate matters, we also already have an enormous pile of disks in a home server that I have put way too much energy into making robust; a similarly-sized volume of storage would cost about $1300 a year with iCloud (and would not fit onto one account, anyway). I’m not totally averse to paying for some service if it’s turnkey, but something that uses our existing pile of storage would definitely get bonus points.

Right now, my plan is to dump all of our photos into a shared photo library on a network drive, only ever edit them at home, try to communicate carefully about when one or the other of us is editing it so we don’t run into weird filesystem concurrency issues, and hope for the best. This does not strike me as a maximally robust solution. Among other problems, it means the library isn’t accessible to our mobile devices. But I can’t think of anything better.

Can you? Email me. If I get a really great answer I’ll post it in a followup.

by Glyph at September 23, 2017 01:27 AM

September 19, 2017

Moshe Zadka

Announcing NColony 17.9.0

I have released NColony 17.9.0, available in a PyPI near you.

New this version:

  • CalVer
  • Python 3 support!
  • You can ask to, explicitly, inherit environment variables from the monitoring process.
  • Website

Thanks to Mark Williams for reviewing many pull requests.

by Moshe Zadka at September 19, 2017 10:00 PM

September 18, 2017

Itamar Turner-Trauring

Join our startup, we'll cut your pay by 40%!

Have you ever thought to yourself, “I need to get paid far far less than I’m worth?” Me neither. And yet some companies not only pay less, they’re proud of it. Allow me to explain—

I recently encountered a job posting from one such startup. My usual response would be to roll my eyeballs and move on, but this particular posting was so egregious that had I done so I would’ve ended up looking at the back of my skull.

So in an effort to avoid the pain of over-rolled eyeballs, and more importantly to help you avoid the pain of working for this kind of company, let me share the key sentence from the job posting:

“It’s not unusual to see some team members in the office late into the evening; many of us routinely work and study 70+ hours a week.”

In this post I will work through the implications of that sentence. I made sure not to drink anything while writing it, because if I had I’d be spitting my drink out every time I reread that sentence. The short version is that should you join such a company, you’d be working for people who are:

  • Exploiting you by massively underpaying you.
  • Destroying your productivity.
  • Awful at project management.

Cutting your salary by 40%

Let’s start with your salary. The standard workweek in the US is 40 hours a week. If you’re going to be working 70 hours a week that means you’re working 75% more hours than usual. Or, to put it another way, the company is offering to pay you 40% less than market rate for your time.

Instead of hiring more engineers, they’re trying to get their engineers to do far more for the same amount of money. This is exploitation, and there’s no reason you should put up with it.

It’s not that hard to find companies where you can work a normal 40 hour workweek. I’ve done so at the past five companies I’ve worked at, ranging from tiny startups to Google. Sometimes you need to push back, it’s true, but it’s certainly possible. And even if you can’t find such a job, there are many more companies where you can work 45 hours, or 50 hours. Even an awful workweek of 60 hours is better than 70.

When programming is your hobby

Now, it may be that you love programming so much that you’re thinking, “I’d be coding 70 hours a week anyway, why not do it at work?” As I’ll mention below, I don’t think working 70 hours a week is going to produce much, but even if it did you still shouldn’t do it on your employer’s behalf.

Let’s imagine you’re coding 70 hours a week. You could work 70 hours for your employer, getting paid nothing extra for your time, or you could stick to 40 hours and use those remaining 30 hours to:

  • Work on a personal project, just for fun: you could learn new skills that you choose, or build something frivolous because you enjoy it.
  • Work on an open source project, helping others as well.
  • Take on consulting work, getting paid more.
  • Start your own startup, so you get a more significant upside from success.

And you’d also have some optional slack time, which is useful when life gets in the way of programming.

“Work not just smart, but also hard”

Encouraging 70 hour workweeks is an extraordinary level of exploitation, but sadly it’s also a rather common form of stupidity. The problem is encapsulated in another statement from the job posting:

“[We] work not just smart, but also hard.”

If your starting point is exploitation, if you’re setting out to extract as much work as possible from your employees, you lose sight of the purpose of work. Work has no inherent value: what matters is the results. The problems solved, the value created, this is what you’re trying to maximize.

And is turns out there’s decades of research showing that consistently working more than 40 hours a week results in less output. But presumably the people running this startup don’t believe that, or they wouldn’t be pushing for it. And maybe you don’t believe that either. But even if we assume 70 hours of work produce 75% more output than 40 hours of work, it’s still a fundamentally bad idea for the company.

When an organization tries to maximize inputs, rather than outputs, the result is a whole series of bad judgments. Hiring, for example, as you can see from this job ad. A junior programmer working 70 hours a week will produce far less valuable output than an experienced programmer working 40 hours a week. But a company that wants to maximize exploitation, to maximize work, will write job ads that ensure the latter will never apply.

Emergencies: when long hours are necessary.

Beyond reduced output, and beyond a confused hiring policy, encouraging long hours also implies a lack of project management skills. Long work hours are both a cause and a symptom of this particular failure.

70 hours a week means 7 days a week, from 9AM to 7PM. That doesn’t leave much slack time for life, and it also leaves no slack time for the project. Sooner or later every project has an emergency. If a production server crashes, someone is going to have to bring it back up. And more broadly, extra work comes up: a customer asks for more features, or a seemingly simple task turns out to be far more difficult than expected.

To help deal with these situations you need some advance planning. Scheduling everything down to the minute won’t help, and pushing everyone to work at the absolute limit won’t help. The problem is unexpected work, after all. What you need is planned slack time, time that hasn’t been budgeted, that’s available for all the inevitable unexpected problems.

But a manager that is pushing you to work 70 hours a week isn’t a manager who plans ahead for unexpected work. No, this is a manager who solves problem by telling you to work harder and longer. So when the unexpected happens, when an emergency happens, your manager will be saying “who coulda knowed? ¯\_(ツ)_/¯” and before you know it you’re working 80 hours a week.

Maybe that will fix things. But I doubt it. More plausibly you’ll eventually burn out and quit, taking your business knowledge with you.

“Strong willingness to help junior engineers”

The job posting that led to this post also suggested that a “strong willingness to help junior engineers” would be helpful, though not required. So here’s my advice to all you junior engineers out there: avoid companies that want you to work crazy hours.

  1. It’s bad for you.
  2. It’s bad for the company.
  3. And you don’t want to work for a manager who isn’t competent enough to realize what’s bad for the company.

And if you are stuck working for such a company, you might want to read my book, The Programmer’s Guide to a Sane Workweek.

September 18, 2017 04:00 AM

September 15, 2017

Jp Calderone

SSH to EC2 (Refrain)

Recently Moshe wrote up a demonstration of the simple steps needed to retrieve an SSH public key from an EC2 instance to populate a known_hostsfile. Moshe's example uses the highly capable boto3 library for its EC2 interactions. However, since his blog is syndicated on Planet Twisted, reading it left me compelled to present an implementation based on txAWS instead.

First, as in Moshe's example, we need argv and expanduser so that we can determine which instance the user is interested in (accepted as a command line argument to the tool) and find the user's known_hosts file (conventionally located in ~):


from sys import argv
from os.path import expanduser
Next, we'll get an abstraction for working with filesystem paths. This is commonly used in Twisted APIs because it saves us from many path manipulation mistakes committed when representing paths as simple strings:

from filepath import FilePath
Now, get a couple of abstractions for working with SSH. Twisted Conch is Twisted's SSH library (client & server). KnownHostsFile knows how to read and write the known_hosts file format. We'll use it to update the file with the new key. Key knows how to read and write SSH-format keys. We'll use it to interpret the bytes we find in the EC2 console output and serialize them to be written to the known_hosts file.

from twisted.conch.client.knownhosts import KnownHostsFile
from twisted.conch.ssh.keys import Key
And speaking of the EC2 console output, we'll use txAWS to retrieve it. AWSServiceRegion is the main entrypoint into the txAWS API. From it, we can get an EC2 client object to use to retrieve the console output.

from txaws.service import AWSServiceRegion
And last among the imports, we'll write the example with inlineCallbacks to minimize the quantity of explicit callback-management code. Due to the simplicity of the example and the lack of any need to write tests for it, I won't worry about the potential problems with confusing tracebacks or hard-to-test code this might produce. We'll also use react to drive the whole thing so we don't need to explicitly import, start, or stop the reactor.

from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
With that sizable preamble out of the way, the example can begin in earnest. First, define the main function using inlineCallbacks and accepting the reactor (to be passed by react) and the EC2 instance identifier (taken from the command line later on):

@inlineCallbacks
def main(reactor, instance_id):
Now, get the EC2 client. This usage of the txAWS API will find AWS credentials in the usual way (looking at AWS_PROFILE and in ~/.aws for us):

region = AWSServiceRegion()
ec2 = region.get_ec2_client()
Then it's a simple matter to get an object representing the desired instance and that instance's console output. Notice these APIs return Deferred so we use yield to let inlineCallbacks suspend this function until the results are available.

[instance] = yield ec2.describe_instances(instance_id)
output = yield ec2.get_console_output(instance_id)
Some simple parsing logic, much like the code in Moshe's implementation (since this is exactly the same text now being operated on). We do take the extra step of deserializing the key into an object that we can use later with a KnownHostsFile object.

keys = (
Key.fromString(key)
for key in extract_ssh_key(output.output)
)
Then write the extracted keys to the known hosts file:

known_hosts = KnownHostsFile.fromPath(
FilePath(expanduser("~/.ssh/known_hosts")),
)
for key in keys:
for name in [instance.dns_name, instance.ip_address]:
known_hosts.addHostKey(name, key)
known_hosts.save()
There's also the small matter of actually parsing the console output for the keys:

def extract_ssh_key(output):
return (
line for line in output.splitlines()
if line.startswith(u"ssh-rsa ")
)
And then kicking off the whole process:

react(main, argv[1:])
Putting it all together:

from sys import argv
from os.path import expanduser

from filepath import FilePath

from twisted.conch.client.knownhosts import KnownHostsFile
from twisted.conch.ssh.keys import Key

from txaws.service import AWSServiceRegion

from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react

@inlineCallbacks
def main(reactor, instance_id):
region = AWSServiceRegion()
ec2 = region.get_ec2_client()

[instance] = yield ec2.describe_instances(instance_id)
output = yield ec2.get_console_output(instance_id)

keys = (
Key.fromString(key)
for key in extract_ssh_key(output.output)
)

known_hosts = KnownHostsFile.fromPath(
FilePath(expanduser("~/.ssh/known_hosts")),
)
for key in keys:
for name in [instance.dns_name, instance.ip_address]:
known_hosts.addHostKey(name, key)
known_hosts.save()

def extract_ssh_key(output):
return (
line for line in output.splitlines()
if line.startswith(u"ssh-rsa ")
)

react(main, argv[1:])

So, there you have it. Roughly equivalent complexity to using boto3 and on its own there's little reason to prefer this to what Moshe has written about. However, if you have a larger Twisted-based application then you may prefer the natively asynchronous txAWS to blocking boto3 calls or managing boto3 in a thread somehow.

Also, I'd like to thank LeastAuthority (my current employer and operator of the Tahoe-LAFS-based S4 service which just so happens to lean heavily on txAWS) for originally implementing get_console_output for txAWS (which, minor caveat, will not be available until the next release of txAWS is out).

As always, if you like this sort of thing, check out the support links on the right.

by Jean-Paul Calderone (noreply@blogger.com) at September 15, 2017 02:34 PM

September 09, 2017

Itamar Turner-Trauring

The better way to learn a new programming language

Have you ever failed to learn a new programming language in your spare time? You pick a small project to implement, get a few functions written… and then you run out of time and motivation. So you give up, at least until the next time you give it a try.

There’s a better way to learn new programming languages, a method that I’ve applied multiple times. Where starting a side project often ends in failure and little knowledge gained, this method starts with success. For example, the last time I did this was with Ruby: I started by publishing a whole new Ruby Gem, and getting a bug fix accepted into the Sinatra framework.

In this post I will:

  • Explain why a side project is a difficult way to learn a new language.
  • Share my story of learning a tiny bit of Ruby.
  • Explain my preferred learning method in detail.

Side projects: the hard way to learn a language

Creating a new software project from scratch in your spare time is a tempting way to learn a new language. You get to build something new: building stuff is fun. You get to pick your language: you have the freedom to choose.

Unfortunately, learning via a side project is a difficult way to learn a new language. When you’re learning a new programming language you need to learn:

  • The build and run toolchain.
  • The packaging toolchain.
  • The testing toolchain.
  • How to use 3rd party packages.
  • The syntax.
  • The semantics: memory management, units of abstraction, concurrency, execution model, and so forth.
  • Standard idioms.
  • The standard library.
  • Common 3rd party libraries relevant to your problem domain: which to use, and their APIs.

This is a huge amount of knowledge, and you’re doing so with multiple handicaps:

Learning on your own: You have to figure out everything on your own.

Blank slate: You’re starting from scratch. Quite often there’s no scaffolding to help you, no good starting point to get you going.

Simultaneous learning: You are trying to learn everything in the list above at the same time.

Limited time: You’re doing this in your spare time, so you may have only limited amounts of spare time to apply to the task.

Lack of motivation: If you care about the side project’s success, you probably will be motivated to switch back to a language you know. If you just care about learning the language, you’ll be less motivated to do all the boring work to make the side project succeed.

Vague goals: “Learning a language” is an open-ended task, since there’s always more to learn. How will you know you’ve achieved something?

Personally I have very limited free time: I can’t start a new side project in a language I already know, let alone a new one. But I do occasionally learn a new language.

That time I learned some Ruby

Rather than learning new languages at home, I use a better method: learning a language by solving problems at my job.

For example, I know very little Ruby, and when I started learning it I knew even less. One day, however, I joined a company that was publishing a SDK in multiple languages, one of which was Ruby.

A tiny gem

My first task involving Ruby was integrating the SDK with popular Ruby HTTP clients and servers. Which is to say, I started learning a new language with a specific goal, motivation, and time to learn at work. Much better than a personal side project!

I started by learning one thing, not multiple things simultaneously: which 3rd party HTTP libraries were popular. Once I’d found the popular HTTP clients and servers, my next task was implementing the SDK integration. One integration was with Sinatra, a popular Ruby HTTP server framework.

As a coding task this was pretty simple:

  1. The Sinatra docs pointed me towards a library called Rack, a standard way to write HTTP server middleware for Ruby.
  2. Rack has documentation and tutorials on how to create middleware.
  3. There are lots of pre-existing middleware packages I could use as examples, for both the code itself and for tests.
  4. I only needed to learn just enough Ruby syntax and semantics to write the middleware. Googling tutorials was enough for that.

I learned just enough to implement the middleware: 40 lines of trivial code.

Next I needed to package the middleware as a gem, Ruby’s packaging format. Once again, I was only working on a single task, a well-documented task with many examples. And I had motivation, specific goals, examples to build off of, and the time to do it.

At this point I’d learned: a tiny bit of syntax and semantics, some 3rd party libraries, packaging, and a little bit of the toolchain.

A bugfix to an existing project

Shortly after creating our SDK integration I discovered a bug in Sinatra: Sinatra middleware was only initialized after the first request. So I tracked down the bug in Sinatra… which gave me an opportunity to learn more of the language’s syntax, semantics, and idioms by reading a real-world existing code base. And, of course, the all-important skill of knowing how to add debug print statements to the code.

Reading code is a lot easier than writing code. And since Sinatra was a pre-existing code base, I could rely on pre-existing tests as examples when I wrote a test for my patch. I didn’t need to figure out how to structure a large project, or every edge case of the syntax that wasn’t relevant to the bug. I had a specific goal, and I learned just enough to reach it.

At the end of the process above I still couldn’t start a Ruby project from scratch, or write more than tiny amounts of Ruby. And I haven’t done much with it since. But I do know enough to deal with packaging, and if I ever started writing Ruby again I’d start with a lot more knowledge of the toolchain, to the point where I’d be able to focus purely on syntax and semantics.

But I’ve used a similar method to learn other languages to a much greater extent: I learned C++ by joining a company that used it, and I became a pretty fluent C++ programmer for a while.

Learning a new language: a better method

How should you learn a new programming language? As in my story above, the best way to do so is at work, and ideally by joining an existing project.

Existing projects

The easiest way to learn a new language is to join an existing project or company that uses a language you don’t know. None of the problems you would have with a side project apply:

  • There’s lot of existing examples to learn from and modify, you’re not starting with blank slate.
  • You don’t have to learn the build/run, testing, and packaging toolchains before you can do anything useful: it’s mostly going to be setup for you, so you can learn it by osmosis over time.
  • You have specific goals: fix this bug, add this feature.
  • You have co-workers you can ask for help, who can review your code, and can help you write more idiomatically.

New projects

Lacking an existing project to join, look out for opportunities where there’s a strong motivation for your project to add a new language. Some examples:

  • You need to do some data science, so you need to add Python or maybe R.
  • Your project has a problem with a computationally intensive bottleneck, so you need to add something like Rust, C++, or C.

Starting a new project is not quite as easy a learning experience, unfortunately. But you’re still starting with specific goals in mind, and with time at work to learn the language.

Make sure to limit yourself to only learning one thing at a time. In my example above I sequentially learned about: which 3rd party libraries existed, the API for one library, writing miniscule amounts of trivial integration code, packaging, and then how to read a lot more syntax and semantics. If you’re doing this with co-workers you can split up tasks: you do the packaging while your co-worker builds the first prototype, and then you can teach each other what you’ve learned.

Learning at work is the best learning

More broadly, your job is a wonderful place to learn. Every task you do at work involves skills, skills you can practice and improve. You can get better at debugging, or notice a repetitive task and automate it, or learn how to write better bug reports. Perhaps you could figure out what needs changing so you can get make changes done faster (processes? architecture? APIs?). Maybe you can figure out how to test your code better to reduce the number of bugs you ship. And if that’s not enough, Julia Evans has even more ideas.

In all these cases you’ll have motivation, specific goals, time, and often an existing code base to build off of. And best of all, you’ll be able to learn while you’re getting paid.

September 09, 2017 04:00 AM

August 31, 2017

Moshe Zadka

SSH to EC2

(Thanks to Donald Stufft for reviewing this post, and to Glyph Lefkowitz for inspiring much of it.)

(JP Calderone wrote a Twisted version of this approach.)

It is often the case that after creating an EC2 instance in AWS, the next step is SSHing. This might be because the machine is a development machine, or it might be tilling the ground for a different remote control: for example, setting up a salt minion.

In those cases, many either press y when seeing SSH prompt them about an unknown host key, or even turn off host key verification altogether. This is convenient, quick, and very insecure. A man in the middle can use this to steal credentials -- maybe not permanently, but enough to log in into any other machine with the same SSH key.

The correct thing to do is to prepare the SSH configuration by retrieving the host key via the AWS API. Unfortunately, doing it is not trivial.

Fortunately, it is a good example of how to use the AWS API from Python.

import sys
import boto3

client = boto3.client('ec2', region_name='us-west-2')
resource = boto3.resource('ec2', region_name='us-west-2')

output = client.get_console_output(InstanceId=sys.argv[1])
result = output['Output']

rsa = [line for line in result.splitlines()
            if line.startswith('ssh-rsa')][0]

instance = resource.Instance(sys.argv[1])
known_hosts = '{},{} {}\n'.format(instance.public_dns_name,
                                  instance.public_ip_address,
                                  rsa)

with open(os.path.expanduser('~/.ssh/known_hosts'), 'a') as fp:
    fp.write(known_hosts)

Let's go through this script section by section.

import sys
import boto3

We import the sys module and the first-party AWS module boto3.

client = boto3.client('ec2', region_name='us-west-2')
resource = boto3.resource('ec2', region_name='us-west-2')

It is often confusing what functionality is in client and what is in resource. The only rule I learned in a year of using the AWS API is to look in both places, and create both a client and a resource. In general, client maps directly to AWS low-level REST API, while resource gives higher level abstractions.

output = client.get_console_output(InstanceId=sys.argv[1])
result = output['Output']

This is the meat of the script -- we use the API to get the console output. These are the boot up messages from all services. When the SSH server starts up, it prints its key. All that is left now is to find it.

rsa = [line for line in result.splitlines()
            if line.startswith('ssh-rsa')][0]

This is a little hacky, but there is no nice way to do it. There are other possible heuristics. The nice thing is that if the heuristic fails, this will result in connection failure -- not an insecure connection!

instance = resource.Instance(sys.argv[1])
known_hosts = '{},{} {}\n'.format(instance.public_dns_name,
                                  instance.public_ip_address,
                                  rsa)

We grab the IP and name through the resource, and format them in the right way for SSH to understand.

with open(os.path.expanduser('~/.ssh/known_hosts'), 'a') as fp:
    fp.write(known_hosts)

I chose to update known_hosts like this because originally this script was in a throw-away Docker image. In other cases, it might be wise to have a separate known hosts file for EC2 instances, or have an atomic update methodology.

After running this code, it is possible to SSH without having to verify the host key. It is best to set the SSH options to fail if the host key is not there, for extra safety.

An alternative approach is to use the AWS API to set the SSH secret key. However, this is, in general, even less trivial to do securely.

by Moshe Zadka at August 31, 2017 04:30 AM

August 17, 2017

Duncan McGreggor

NASA/EOSDIS Earthdata

Update

It's been a few years since I posted on this blog -- most of the technical content I've been contributing to in the past couple years has been in the following:
But since the publication of the Mastering matplotlib book, I've gotten more and more into satellite data. The book, it goes without saying, focused on Python for the analysis and interpretation of satellite data (in one of the many topics covered). After that I spent some time working with satellite and GIS data in general using Erlang and LFE. Ultimately though, I found that more and more projects were using the JVM for this sort of work, and in particular, I noted that Clojure had begun to show up in a surprising number of Github projects.

EOSDIS

Enter NASA's Earth Observing System Data and Information System (see also earthdata.nasa.gov and EOSDIS on Wikipedia), a key part of the agency's Earth Science Data Systems Program. It's essentially a concerted effort to bring together the mind-blowing amounts of earth-related data being collected throughout, around, and above the world so that scientists may easily access and correlate earth science data for their research.

Related NASA projects include the following:
The acronym menagerie can be bewildering, but digging into the various NASA projects is ultimately quite rewarding (greater insights, previously unknown resources, amazing research, etc.).

Clojure

Back to the Clojure reference I made above:  I've been contributing to the nasa/Common-Metadata-Repository open source project (hosted on Github) for a few months now, and it's been amazing to see how all this data from so many different sources gets added, indexed, updated, and generally made so much more available to any who want to work with it. The private sector always seems to be so far ahead of large projects in terms of tech and continuously improving updates to existing software, so its been pretty cool to see a large open source project in the NASA Github org make so many changes that find ways to keep helping their users do better research. More so that users are regularly delivered new features in a large, complex collection of libraries and services thanks in part to the benefits that come from using a functional programming language.

It may seem like nothing to you, but the fact that there are now directory pages for various data providers (e.g., GES_DISC, i.e., Goddard Earth Sciences Data and Information Services Center) makes a big difference for users of this data. The data provider pages now also offer easy access to collection links such as UARS Solar Ultraviolet Spectral Irradiance Monitor. Admittedly, the directory pages still take a while to load, but there are improvements on the way for page load times and other related tasks. If you're reading this a month after this post was written, there's a good chance it's already been fixed by now.

Summary

In summary, it's been a fun personal journey from looking at Landsat data for writing a book to working with open source projects that really help scientists to do their jobs better :-) And while I have enjoyed using the other programming languages to explore this problem space, Clojure in particular has been a delightfully powerful tool for delivering new features to the science community.

by Duncan McGreggor (noreply@blogger.com) at August 17, 2017 02:05 PM

August 16, 2017

Itamar Turner-Trauring

The tragic tale of the deadlocking Python queue

This is a story about how very difficult it is to build concurrent programs. It’s also a story about a bug in Python’s Queue class, a class which happens to be the easiest way to make concurrency simple in Python. This is not a happy story: this is a tragedy, a story of deadlocks and despair.

This story will take you on a veritable roller coaster of emotion and elucidation, as you:

  • Shiver at the horror that is concurrent programming!
  • Bask in the simplicity of using Queue!
  • Frown at a mysteriously freezing program!
  • Marvel as you discover how to debug deadlocks with gdb!
  • Groan as reentrancy rears its ugly head!
  • Gasp as you realize that this bug is not theoretical!
  • Weep when you read the response of Python’s maintainers!

Join me, then, as I share this tale of woe.

Concurrency is hard

Writing programs with concurrency, programs with multiple threads, is hard. Without threads code is linear: line 2 is executed after line 1, with nothing happening in between. Add in threads, and now changes can happen behind your back.

Race conditions

The following counter, for example, will become corrupted if increment() is called from multiple threads:

from threading import Thread

class Counter(object):
    def __init__(self):
        self.value = 0
    def increment(self):
        self.value += 1

c = Counter()

def go():
    for i in range(1000000):
        c.increment()

# Run two threads that increment the counter:
t1 = Thread(target=go)
t1.start()
t2 = Thread(target=go)
t2.start()
t1.join()
t2.join()
print(c.value)

Run the program, and:

$ python3 racecondition.py
1686797

We incremented 2,000,000 times, but that’s not what we got. The problem is that self.value += 1 actually takes three distinct steps:

  1. Getting the attribute,
  2. incrementing it,
  3. then setting the attribute.

If two threads call increment() on the same object around the same time, the following series steps may happen:

  1. Thread 1: Get self.value, which happens to be 17.
  2. Thread 2: Get self.value, which happens to be 17.
  3. Thread 1: Increment 17 to 18.
  4. Thread 1: Set self.value to 18.
  5. Thread 2: Increment 17 to 18.
  6. Thread 1: Set self.value to 18.

An increment was lost due to a race condition.

One way to solve this is with locks:

from threading import Lock

class Counter(object):
    def __init__(self):
        self.value = 0
        self.lock = Lock()
    def increment(self):
        with self.lock:
            self.value += 1

Only one thread at a time can hold the lock, so only one increment happens at a time.

Deadlocks

Locks introduce their own set of problems. For example, you start having potential issues with deadlocks. Imagine you have two locks, L1 and L2, and one thread tries to acquire L1 followed by L2, whereas another thread tries to acquire L2 followed by L1.

  1. Thread 1: Acquire and hold L1.
  2. Thread 2: Acquire and hold L2.
  3. Thread 1: Try to acquire L2, but it’s in use, so wait.
  4. Thread 2: Try to acquire L1, but it’s in use, so wait.

The threads are now deadlocked: no execution will proceed.

Queues make concurrency simpler

One way to make concurrency simpler is by using queues, and trying to have no other shared data structures. If threads can only send messages to other threads using queues, and threads never mutate data structures shared with other threads, the result is code that is much closer to single-threaded code. Each function just runs one line at a time, and you don’t need to worry about some other thread interrupting you.

For example, we can have a single thread whose job it is to manage a collection of counters:

from collections import defaultdict
from threading import Thread
from queue import Queue

class Counter(object):
    def __init__(self):
        self.value = 0
    def increment(self):
        self.value += 1


counter_queue = Queue()


def counters_thread():
    counters = defaultdict(Counter)
    while True:
        # Get next command out the queue:
        command, name = counter_queue.get()
        if command == "increment":
            counters[name].increment()

# Start a new thread:
Thread(target=counters_thread).start()

Now other threads can safely increment a named counter by doing:

counter_queue.put(("increment", "shared_counter_1"))

A buggy program

Unfortunately, queues have some broken edge cases. Consider the following program, a program which involves no threads at all:

from queue import Queue

q = Queue()


class Circular(object):
    def __init__(self):
        self.circular = self

    def __del__(self):
        print("Adding to queue in GC")
        q.put(1)


for i in range(1000000000):
    print("iteration", i)
    # Create an object that will be garbage collected
    # asynchronously, and therefore have its __del__
    # method called later:
    Circular()
    print("Adding to queue regularly")
    q.put(2)

What I’m doing here is a little trickery involving a circular reference, in order to add an item to the queue during garbage collection.

By default CPython (the default Python VM) uses reference counting to garbage collect objects. When an object is created the count is incremented, when a reference is removed the count is decremented. When the reference count hits zero the object is removed from memory and __del__ is called on it.

However, an object with a reference to itself—like the Circular class above—will always have a reference count of at least 1. So Python also runs a garbage collection pass every once in a while that catches these objects. By using a circular reference we are causing Circular.__del__ to be called asynchronously (eventually), rather than immediately.

Let’s run the program:

$ python3 bug.py 
iteration 0
Adding to queue regularly
Adding to queue in GC

That’s it: the program continues to run, but prints out nothing more. There are no further iterations, no progress.

What’s going on?

Debugging a deadlock with gdb

Modern versions of the gdb debugger have some neat Python-specific features, including ability to print out a Python traceback. Setup is a little annoying, see here and here and maybe do a bit of googling, but once you have it setup it’s extremely useful.

Let’s see what gdb tells us about this process. First we attach to the running process, and then use the bt command to see the C backtrace:

$ ps x | grep bug.py
28920 pts/4    S+     0:00 python3 bug.py
$ gdb --pid 28920
...
(gdb) bt
#0  0x00007f756c6d0946 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x6464e96e00) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x6464e96e00, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f756c6d09f4 in __new_sem_wait_slow (sem=0x6464e96e00, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f756c6d0a9a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00007f756ca7cbd5 in PyThread_acquire_lock_timed () at /usr/src/debug/Python-3.5.3/Python/thread_pthread.h:352
...

Looks like the process is waiting for a lock. I wonder why?

Next, we take a look at the Python backtrace:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method __enter__ of _thread.lock object at remote 0x7f756cef36e8>
  File "/usr/lib64/python3.5/threading.py", line 238, in __enter__
    return self._lock.__enter__()
  File "/usr/lib64/python3.5/queue.py", line 126, in put
    with self.not_full:
  File "bug.py", line 12, in __del__
    q.put(1)
  Garbage-collecting
  File "/usr/lib64/python3.5/threading.py", line 345, in notify
    waiters_to_notify = _deque(_islice(all_waiters, n))
  File "/usr/lib64/python3.5/queue.py", line 145, in put
    self.not_empty.notify()
  File "bug.py", line 21, in <module>
    q.put(2)

Do you see what’s going on?

Reentrancy!

Remember when I said that, lacking concurrency, code just runs one line at a time? That was a lie.

Garbage collection can interrupt Python functions at any point, and run arbitrary other Python code: __del__ methods and weakref callbacks. So can signal handlers, which happen e.g. when you hit Ctrl-C (your process gets the SIGINT signal) or a subprocess dies (your process gets the SIGCHLD signal).

In this case:

  1. The program was calling q.put(2).
  2. This involves acquiring a lock.
  3. Half-way through the function call, garbage collection happens.
  4. Garbage collection calls Circular.__del__.
  5. Circular.__del__ calls q.put(1).
  6. q.put(1) trys to acquire the lock… but the lock is already held, so it waits.

Now q.put(2) is stuck waiting for garbage collection to finish, and garbage collection can’t finish until q.put(2) releases the lock.

The program is deadlocked.

Why this is a real bug…

The above scenario may seem a little far-fetched, but it has been encountered by multiple people in the real world. A common cause is logging.

If you’re writing logs to disk you have to worry about the disk write blocking, i.e. taking a long time. This is particularly the case when log writes are followed by syncing-to-disk, which is often done to ensure logs aren’t lost in a crash.

A common pattern is to create log messages in your application thread or threads, and do the actual writing to disk in a different thread. The easiest way to communicate the messages is, of course, a queue.Queue.

This use case is in fact directly supported by the Python standard library:

from queue import Queue
import logging
from logging.handlers import QueueListener, QueueHandler

# Write out queued logs to a file:
_log_queue = Queue()
QueueListener(
    _log_queue, logging.FileHandler("out.log")).start()

# Push all logs into the queue:
logging.getLogger().addHandler(QueueHandler(_log_queue))

Given this common setup, all you need to do to trigger the bug is to log a message in __del__, a weakref callback, or a signal handler. This happens in real code. For example, if you don’t explicitly close a file, Python will warn you about it inside file.__del__, and Python also has a standard API for routing warnings to the logging system.

It’s not just logging, though: the bug was also encountered among others by the SQLAlchemy ORM.

…and why Python maintainers haven’t fixed it

(Update: after I wrote this blog post the Python dev team reopened the bug; hopefully it’ll be fixed in Python 3.7.)

This bug was originally reported in 2012, and in 2016 it was closed as “wont fix” because it’s a “difficult problem”.

I feel this is a cop-out. If you’re using an extremely common logging pattern, where writes happen in a different thread, a logging pattern explicitly supported by the Python standard library… your program might deadlock. In particular, it will deadlock if any of the libraries you’re using writes a log message in __del__.

This can happen just by using standard Python APIs like files and warning→logging routing. This happened to one of the users of my Crochet library, due to some logging in __del__ by the Twisted framework. I had to implement my own queuing system to ensure users weren’t impacted by this problem. If I can fix the problem, so can the Python maintainers. For example, Queue.get and Queue.put could be atomic operations (which can be done in CPython by rewriting them in C).

Now, you could argue that __del__ shouldn’t do anything: it should schedule stuff that is run outside it. But scheduling from reentrant code is tricky, and in fact not that different from mutating a shared data structure from multiple threads. If only there was a queue of some sort that we could call from __del__… but there isn’t, because of this bug.

Some takeaways

  1. Concurrency is hard to deal with, but queue.Queue helps.
  2. Reentrancy is hard to deal with, and Python helps you a lot less.
  3. If you’re using queue.Queue on Python, beware of interacting with the queue in __del__, weakref callbacks, or signal handlers.

And by the way, if you enjoyed reading this and would like to hear about all the many ways I’ve screwed up my own software, sign up for my Software Clown newsletter. Every week I share one of my mistakes and how you can avoid it.

Update: Thanks to Maciej Fijalkowski for suggesting actually demonstrating the race condition, and pointing out that __del__ probably really shouldn’t do anything. Thanks to Ann Yanich for pointing out a typo in the code.

August 16, 2017 04:00 AM

August 10, 2017

Duncan McGreggor

Mastering matplotlib: Acknowledgments

The Book

Well, after nine months of hard work, the book is finally out! It's available both on Packt's site and Amazon.com. Getting up early every morning to write takes a lot of discipline, it takes even more to say "no" to enticing rabbit holes or herds of Yak with luxurious coats ripe for shaving ... (truth be told, I still did a bit of that).

The team I worked with at Packt was just amazing. Highly professional and deeply supportive, they were a complete pleasure with which to collaborate. It was the best experience I could have hoped for. Thanks, guys!

The technical reviewers for the book were just fantastic. I've stated elsewhere that my one regret was that the process with the reviewers did not have a tighter feedback loop. I would have really enjoyed collaborating with them from the beginning so that some of their really good ideas could have been integrated into the book. Regardless, their feedback as I got it later in the process helped make this book more approachable by readers, more consistent, and more accurate. The reviewers have bios at the beginning of the book -- read them, and look them up! These folks are all amazing!

The one thing that slipped in the final crunch was the acknowledgements, and I hope to make up for that here, as well as through various emails to everyone who provided their support, either directly or indirectly.

Acknowledgments

The first two folks I reached out to when starting the book were both physics professors who had published very nice matplotlib problems -- one set for undergraduate students and another from work at the National Radio Astronomy Observatory. I asked for their permission to adapt these problems to the API chapter, and they graciously granted it. What followed were some very nice conversations about matplotlib, programming, physics, education, and publishing. Thanks to Professor Alan DeWeerd, University of Redlands and Professor Jonathan W. Keohane, Hampden Sydney College. Note that Dr. Keohane has a book coming out in the fall from Yale University Press entitled Classical Electrodynamics -- it will contain examples in matplotlib.

Other examples adapted for use in the API chapter included one by Professor David Bailey, University of Toronto. Though his example didn't make it into the book, it gets full coverage in the Chapter 3 IPython notebook.

For one of the EM examples I needed to derive a particular equation for an electromagnetic field in two wires traveling in opposite directions. It's been nearly 20 years since my post-Army college physics, so I was very grateful for the existence and excellence of SymPy which enabled me to check my work with its symbolic computations. A special thanks to the SymPy creators and maintainers.

Please note that if there are errors in the equations, they are my fault! Not that of the esteemed professors or of SymPy :-)

Many of the examples throughout the book were derived from work done by the matplotlib and Seaborn contributors. The work they have done on the documentation in the past 10 years has been amazing -- the community is truly lucky to have such resources at their fingertips.

In particular, Benjamin Root is an astounding community supporter on the matplotlib mail list, helping users of every level with all of their needs. Benjamin and I had several very nice email exchanges during the writing of this book, and he provided some excellent pointers, as he was finishing his own title for Packt: Interactive Applications Using Matplotlib. It was geophysicist and matplotlib savant Joe Kington who originally put us in touch, and I'd like to thank Joe -- on everyone's behalf -- for his amazing answers to matplotlib and related questions on StackOverflow. Joe inspired many changes and adjustments in the sample code for this book. In fact, I had originally intended to feature his work in the chapter on advanced customization (but ran out of space), since Joe has one of the best examples out there for matplotlib transforms. If you don't believe me, check out his work on stereonets. There are many of us who hope that Joe will be authoring his own matplotlib book in the future ...

Olga Botvinnik, a contributor to Seaborn and PhD candidate at UC San Diego (and BioEng/Math double major at MIT), provided fantastic support for my Seaborn questions. Her knowledge, skills, and spirit of open source will help build the community around Seaborn in the years to come. Thanks, Olga!

While on the topic of matplotlib contributors, I'd like to give a special thanks to John Hunter for his inspiration, hard work, and passionate contributions which made matplotlib a reality. My deepest condolences to his family and friends for their tremendous loss.

Quite possibly the tool that had the single-greatest impact on the authoring of this book was IPython and its notebook feature. This brought back all the best memories from using Mathematica in school. Combined with the Python programming language, I can't imagine a better platform for collaborating on math-related problems or producing teaching materials for the same. These compliments are not limited to the user experience, either: the new architecture using ZeroMQ is a work of art. Nicely done, IPython community! The IPython notebook index for the book is available in the book's Github org here.

In Chapters 7 and 8 I encountered a bit of a crisis when trying to work with Python 3 in cloud environments. What was almost a disaster ended up being rescued by the work that Barry Warsaw and the rest of the Ubuntu team did in Ubuntu 15.04, getting Python 3.4.2 into the release and available on Amazon EC2. You guys saved my bacon!

Chapter 7's fictional case study examining the Landsat 8 data for part of Greenland was based on one of Milos Miljkovic's tutorials from PyData 2014, "Analyzing Satellite Images With Python Scientific Stack". I hope readers have just as much fun working with satellite data as I did. Huge thanks to NASA, USGS, the Landsat 8 teams, and the EROS facility in Sioux Falls, SD.

My favourite section in Chapter 8 was the one on HDF5. This was greatly inspired by Yves Hilpisch's presentation "Out-of-Memory Data Analytics with Python". Many thanks to Yves for putting that together and sharing with the world. We should all be doing more with HDF5.

Finally, and this almost goes without saying, the work that the Python community has done to create Python 3 has been just phenomenal. Guido's vision for the evolution of the language, combined with the efforts of the community, have made something great. I had more fun working on Python 3 than I have had in many years.

by Duncan McGreggor (noreply@blogger.com) at August 10, 2017 04:12 AM

Itamar Turner-Trauring

Python decorators, the right way: the 4 audiences of programming languages

Python decorators are a useful but flawed language feature. Intended to make source code easier to write, and a little more readable, they neglect to address another use case: that of the programmer who will be calling the decorated code.

If you’re a Python programmer, the following post will show you why decorators exist, and how to compensate for their limitations. And even if you’re not a Python a programmer, I hope to demonstrate the importance of keeping in mind all of the different audiences for the code you write.

Why decorators exists: authoring and reading code

A programming language needs to satisfy four different audiences:

  1. The computer which will run the source code.
  2. The author, the programmer writing the source code.
  3. A future reader of the source code.
  4. A future caller of the source code, a programmer who will write code that calls functions and classes in the source code.

Python decorators were created for authors and readers, but neglect the needs of callers. Let’s start by seeing what decorators are, and how they make it easier to author and read code.

Imagine you want to emulate the Java synchronized keyword: you want to run a method of a class with a lock held, so only one thread can call the method at a time. You can do so with the following code, where the synchronized functions creates a new, replacement method that wraps the given one:

from threading import Lock

def synchronized(function):
    """
    Given a method, return a new method that acquires a
    lock, calls the given method, and then releases the
    lock.
    """
    def wrapper(self, *args, **kwargs):
        """A synchronized wrapper."""
        with self._lock:
            return function(self, *args, **kwargs)
    return wrapper

You can then use the synchronized utility like so:

class ExampleSynchronizedClass:
    def __init__(self):
        self._lock = Lock()
        self._items = []

    # Problematic usage:
    def add(self, item):
        """Add a new item."""
        self._items.append(item)
    add = synchronized(add)

As an author this usage is problematic: you need to type “add” twice, leading to a potential for typos. As a reader of the code you also only learn that add() is synchronized at the end, rather than the beginning. Python therefore provides the decorator syntax, which does the exact same thing as the above but more succinctly:

class ExampleSynchronizedClass:
    def __init__(self):
        self._lock = Lock()
        self._items = []

    # Nicer decorator usage:
    @synchronized
    def add(self, item):
        """Add a new item."""
        self._items.append(item)

Where decorators fail: calling code

The problem with decorators is that they fail to address the needs of programmers calling the decorated functions. As a user of ExampleSynchronizedClass you likely want your editor or IDE to show the docstring for add, and to detect the appropriate signature. Likewise if you’re writing documentation and want to automatically generate an API reference from the source code.

But in fact, what you get is the signature, name and docstring for the wrapper function:

>>> help(ExampleSynchronizedClass.add)
Help on method wrapper in module synchronized:

wrapper(self, *args, **kwargs) unbound synchronized.ExampleSynchronizedClass method
    A synchronized wrapper.

To solve this Python provides a utility decorator called functools.wraps, that copies attributes like name and docstring from the wrapped function. We change the implementation of the decorator:

from threading import Lock
from functools import wraps

def synchronized(function):
    """
    Given a method, return a new method that acquires a
    lock, calls the given method, and then releases the
    lock.
    """
    @wraps(function)
    def wrapper(self, *args, **kwargs):
        """A synchronized wrapper."""
        with self._lock:
            return function(self, *args, **kwargs)
    return wrapper

And now we get better help:

Help on method add in module synchronized:

add(self, item) unbound synchronized.ExampleSynchronizedClass method
    Add a new item.

In versions of Python less than 3.4 signature will still be wrong: it’s still the signature of the wrapper, not the underlying function. If you want to support older versions of Python, one solution is to use a 3rd party library called wrapt. We redefine our decorator once more, this time using wrapt instead of functools.wraps:

import wrapt
from threading import Lock

@wrapt.decorator
def synchronized(function, self, args, kwargs):
    """
    Given a method, return a new method that acquires a
    lock, calls the given method, and then releases the
    lock.
    """
    with self._lock:
        return function(*args, **kwargs)

Beyond supporting older versions of Python, wrapt also has the benefit of being more succinct.

Addressing all audiences

While functols.wraps and wrapt do the trick, they still require you to remember to use them every time you define a new decorator. Arguably this is a failure in the Python language: it would’ve been more elegant to do the equivalent functionality as part of the @ syntax in the language itself, rather than relying on library code to fix it.

When you are writing a library, or perhaps even designing a programming language, it’s always worth keeping in mind that you need to support four distinct audiences: the computer, authors, readers and callers. And if you’re a Python programmer creating a decorator, do use wrapt: it’ll make your callers happier, and since it’s also more succinct it will also make life a little easier for your readers.

Updated: Noted Python 3.4 does do signatures, and tried to make the issue with flaw more explicit. Thanks to Kevin Granger and hwayne for suggestions.

August 10, 2017 04:00 AM

August 08, 2017

Moshe Zadka

Python as a DSL

This is a joint post by Mark Williams and Moshe Zadka. You are probably reading it on one of our blogs -- if so, feel free to look at the other blog. We decided it would be fun to write a post together and see how it turns out. We definitely had fun writing it, and we hope you have fun reading it.

Introduction

A Domain Specific Language is a natural solution to many problems. However, creating a new language from whole cloth is both surprisingly hard and, more importantly, surprisingly hard to get right.

One needs to come up with a syntax that is easy to learn, easy to get right, hard to get wrong, and has the ability to give meaningful errors when incorrect input is given. One needs to carefully document the language, supplying at least a comprehensive reference, a tutorial, and a best practices guide all with examples.

On top of this, one needs to write a toolchain for the language that is as high quality as the one users are used to from other languages.

All of this raises a tempting question: can we use an existing language? In this manner, many languages have been used, or abused, as domain specific languages -- Lisp variants (such as Scheme) were among the first to be drafted, but were quickly followed by languages like TCL, Lua, and Ruby.

Python, being popular in quite a few niches, has also been used as a choice for things related to those niches -- the configuration format for Jupyter, the website structure specification in Pyramid the build directives for SCons, and the target specification for Pants.

In this post, we will show examples of Python as a Domain Specific Language (or DSL) and explain how to do it well -- and how to avoid doing it badly.

As programmers we use a variety of languages to solve problems. Usually these are "general purpose" languages, or languages whose design allows them to solve many kinds of problems equally well. Python certainly fits this description. People use it to solve problems in astronomy and biology, to answer questions about data sets large and small, and to build games, websites, and DNS servers.

Python programmers know how much value there is in generality. But sometimes that generality makes solving a problem tedious or otherwise difficult. Sometimes, a problem or class of problems requires so much set up, or has so many twists and turns, that its obvious solution in a general purpose language becomes complicated and hard to understand.

Domain specific languages are languages that are tailored to solve specific problems. They contain special constructions, syntax, or other affordances that organize patterns common to the problems they solve.

Emacs Lisp, or Elisp, is a Domain Specific Language focused on text editing. Emacs users can teach Emacs to do novel things by extending the editor with Elisp.

Here's an example of an Elisp function that swaps ' with " and vice-versa when the cursor is inside a Python string:

(defun python-swap-quotes ()
  "Swap single and double quotes."
  (interactive)
  (save-excursion
    (let ((state (syntax-ppss)))
      (when (eq 'string (syntax-ppss-context state))
        (let* ((left (nth 8 state))
               (right (1- (scan-sexps left 1)))
               (newquote (if (= ?' (char-after left))
                             ?\" ?')))
          (dolist (loc (list left right))
            (goto-char loc)
            (delete-char 1)
            (insert-char newquote 1)))))))

This is clearly Lisp code, and parts of it, such as defining a function with defun or variables with let, is not specific to text editing or even Emacs.

(interactive), however, is a special extension to Elisp that makes the function that encloses it something a user can assign to a keyboard short cut or select from a menu inside Emacs. Similarly, (save-excursion ...) ensures that file the user is editing and the location of the cursor is restored fter the code inside is run. This allows the function to jump around within a file or even multiple files without disturbing a user's place.

Lots of Elisp code makes use of special extensions, but Python programmers don't complain about their absence, because they're of no use outside Emacs. That specialization makes Elisp a DSL.

The language of Dockerfiles is also a domain specific language. Here's a simple hello world Dockerfile:

FROM scratch
COPY hello /
ENTRYPOINT ["/hello"]

The word that begins each line instructs Docker to perform some action on the arguments that follow, such as copying the file hello from the current directory into the image's root directory. Some of these commands have meaning specifically to Docker, such as the FROM command to underlay the image being built with a base image.

Note that unlike Elisp, Dockerfiles are not Turing complete, but both are DSLs. Domain specificity is distinct from mathematical concepts like decidability. It's a term we use to describe how specialized a language is to its problem domain, not a theoretical Computer Science term.

Code written in a domain specific language should be clearer and easier to understand because the language focuses on the domain, while the programmer focuses on the specific problem.

The Elisp code won't win any awards for elegance or robustness, but it benefits from the brevity of (interactive) and (save-excursion ..). Most of the function consists of the querying and computation necessary to find and rewrite Python string literals. Similarly, the Dockerfile doesn't waste the reader's attention on irrelevant details, like how a base image is found and associated with the current image. These DSLs keep their programs focused on their problem domains, making them easier to understand and extend.

Naive Usage of Python as a DSL

Programmers describe things that hide complexity behind a dubiously simple facade as magic. For some reason, when the idea of using Python as a DSL first comes up, many projects choose the strategy we will call "magical execution context". It is more common in projects written in C/C++ which embed Python, but happens quite a bit in pure-Python projects.

The prototypical code that creates a magical execution context might look something like:

namespace = dict(DomainFunction1=my_domain_function1,
                 DomainFunction2=my_domain_function2)
with open('Domainspecificfile') as fp:
    source = fp.read()
exec(source, globals=namespace)
do_something_with(namespace['special_name'])

Real-life DSLs usually have more names in their magical execution contexts (often ranging in the tens or sometimes hundreds), and DSL runtimes often have more complicated logic around finding the files they parse. However, this platonic example is useful to keep in mind when reading through the concrete examples.

While various other projects were automatable with Python, SCons might be the oldest surviving project where Python is used as a configuration language. It also happens to be implemented in Python -- but aside from making the choice of Python as a DSL easier to implement, it has no bearing on our discussion today.

An SCons file might look like this:

src_files = Split("""main.c
                     file1.c
                     file2.c""")
Program('program', src_files)

Code can also be imported from other files:

SConscript(['drivers/SConscript',
            'parser/SConscript',
            'utilities/SConscript'])

Note that it is not possible, via this method, to reuse any logic other than build settings across the files -- a function defined in one of them is not available elsewhere else.

At 12 years old, Django is another venerable Python project, and like the similarly venerable Ruby on Rails, it's no stranger to magic. Once upon a time, Django's database interaction APIs were magical enough that they constituted a kind of domain-specific language with a magical execution context.

Like modern Django, you would define your models by subclassing a special class, but unlike modern Django, they were more than just plain old Python classes.

A Django application in a module named best_sellers.py might have had a model that looked like this:

from django.core import meta

class Book(meta.Model):
      name = meta.CharField(maxlength=70)
      author = meta.CharField(maxlength=70)
      sold = meta.IntegerField()
      release_date = meta.DateTimeField(default=meta.LazyDate())

      def get_best_selling_authors(self):
          cursor = db.cursor()
          cursor.execute("""
          SELECT author FROM books WHERE release_date > '%s'
          GROUP BY author ORDER BY sold DESC
          """ % (db.quote(datetime.datetime.now() - datetime.timedelta(weeks=1)),))
          return [row[0] for row in cursor.fetchall()]

      def __repr__(self):
          return self.full_name

A user would then use it by like so:

from django.models.best_sellers import books
print books.get_best_selling_authors()

Django transplated the Book model into its own magic models module and renamed it books. Note the subtle transformation in the midst of more obvious magic: the Book model was lowercased and automatically pluralized.

Two magic globals were injected into the model's instance methods: db, the current database connection, and datetime, the Python standard library module. That's why our example module doesn't have to import them.

The intent was to reduce boilerplate by exploiting Python's dynamicism. The result, however, diverged from Python's expected behaviors and also invented new, idiosyncratic boilerplate; in particular, the injection of special globals prevented methods from accessing variables defined in their source modules, so methods had to directly import any module they used, forcing programmers to repeat themselves.

Django's developers came to see these features as "warts" and removed them before the 0.95 release. It's safe to say that the "magic-removal" process succeeded in improving Django's usability.

Python has well-documented built-ins. People who read Python code are, usually, familiar with those. Any symbol which is not a built-in or a reserved word is imported.

Any DSL will have its own, extra built-ins. Ideally, those are well documented -- but even when they are, this is a source of documentation separate from the host language. This code can never be used by something outside the DSL. A good example for such potential usage is for unit testing the code. Once a DSL catches on, it often inspires creation of vast amounts of code. The example of Elisp is particularly telling.

Another problem with such code is that it's often not obvious what the code itself is allowed to import or call. Is it safe to do long-running operations? If logging to a file, is logging properly set-up? Will the code double log messages, or does it cache the first time it is used? As an example, there are a number of questions about how to share code between SCons on StackOverflow, with explanations about the trade-offs between using an SConscript file or using Python modules and import.

Last, but not least, other Python code often implicitly assumes that functions and classes are defined by modules. This means that either it is ill-advised to write such in the DSL -- perhaps defining classes might lead to a memory leak because the contents are used in exec multiple times -- or, worse, that random functionality will break. For example, do tracebacks work correctly? Does pickle?

A New Hope

As seen from the examples of SCons and old, magical Django, naively using Python as a DSL is problematic. It gives up a lot of the benefits of using a pre-existing language, and results in something that is in the Python uncanny valley -- just close enough to Python that the distinctions result in a sense of horror, not cuteness.

One way to avoid the uncanny valley is to step further away and avoid confusion -- implement a little language using PyParsing that is nothing like Python. But valleys have two sides. We can solve the problem by just using pure, unadulterated Python. It turns out that removing an import statement at the top of the file does not reduce much overhead when specializing to a domain.

We explore, by example, good ways to use Python as DSL. We start by showing how even a well written module, taking advantage of some of the power of Python, can create a de-facto DSL. Taking it to the next level, frameworks (which call user code) can also be used to build DSLs in Python. Most powerfully, especially combined with libraries and frameworks, Python plugin systems can be used to avoid even the need for a user-controlled entry point, allowing DSLs which can be invoked from an arbitrary program.

Python is a flexible language and subtle use of its features can result in a flexible DSL that still looks like Python.

We explore four examples of such DSLs -- NumPy, Stan, Django ORM, and Pyramid.

NumPy

NumPy has the advantage of having been there since the dawn of Python, being preceded by the Numeric library, on which it was based. Using that long lineage, it has managed to exert some influence on adding some things to Python core's syntax -- the Ellipsis type and the @ operator.

Taking advantage of both those, as well as combinations of things that already exist in Python, NumPy is basically a DSL for performing multi-dimensional calculations.

As an example,

x[4,...,5,:]

lowers the dimension of x by 2, killing the first and next-to-last dimension. How does it work? We can explore what happens using this proof-of-concept:

class ItemGetterer(object):
    def __getitem__(self, idx):
        return idx

x = ItemGetterer()
print(x[4,...,5,:])

This prints (4, Ellipsis, 5, slice(None, None, None)).

In NumPy, the __getitem__ method expects tuples, and will parse them for numbers, the Ellipsis object and slice objects -- and then apply them to the number.

In addition, overriding the methods corresponding to the arithmetic operators, known as operator overloading, allows users of NumPy to write code that looks like the corresponding math expression.

Stan

Stan is a way to produce XML documents using pure Python syntax. This is often useful in web frameworks, which need to produce HTML.

For illustration, here is an example stan-based program""

from nevow import flat, tags, stan

video = stan.Tag('video')

aDocument = tags.html[
                tags.head[
                    tags.title["Title"]
                ],
                tags.body[
                    tags.h1["Heading" ],
                    tags.p(class_="life")["A paragraph about life."],
                    video["Your video here!"],
                ]
            ]
with open('output.html', 'w') as fp:

The tags module has a few popular tags. Those are instances of the stan.Tag class. If a new tag is needed, for example the <video> tag above, one can be added locally.

This is completely valid Python, without any magical execution contexts, in a regular importable module -- which allows easy generation of HTML.

As an example of the advantages of making this a regular Python execution context, we can see the benefits of dynamically generating HTML:

from nevow import flat, tags
bullets = [tags.li["bullet {}".format(i)] for i in range(10)]
aDocument = tags.html[
                tags.body[
                    tags.ul[bullets]
                ]
            ]
with open('output.html', 'w') as fp:
    fp.write(flat.flatten(aDocument))

In more realistic scenarios, this would be based on a database call, or a call to some microservice. Because stan is just pure Python code, it is easy to integrate into whatever framework expects it -- it can be returned from a function, or set as an object attribute.

The line between "taking advantage Python syntax and magic method overriding" and "abusing Python syntax" is sometimes subtle and always at least partially subjective. However, Python does allow surprising flexibility when it comes using pieces of the syntax for new purposes.

This gives great powers to mere library authors, without any need esoterica such as pushing and pulling variables into dictionaries before or after execing code. The with keyword, which we have not covered here, also often comes in handy for building DSLs in Python which do not need magic to work.

Django ORM

Operator overloading is one way Python allows programmers to imbue existing syntax with new, domain-specific semantics. When those semantics describe data with a repeated structure, Python's class system provides a natural model, and *metaclasses* allow you to extend that model to suite your purpose. This makes them a power tool for implementing Python DSLs.

Object-relational mapping (ORM) libraries often use metaclasses to ease defining and querying database tables. Django's Model class is the canonical example. Note that the API we're about to describe is part of modern, post-magic-removal Django!

Consider the models defined in Django's tutorial:

from django.db import models


class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published')


class Choice(models.Model):
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    choice_text = models.CharField(max_length=200)
    votes = models.IntegerField(default=0)

Each class encapsulates knowledge about and actions on a database table. The class attributes map to columns and inter-table relationships which power data manipulation and from which Django derives migrations. Django's models turn classes in a domain-specific language for database definitions and logic.

Here's what the generated DML might look like:

--
-- Create model Choice
--
CREATE TABLE "polls_choice" (
    "id" serial NOT NULL PRIMARY KEY,
    "choice_text" varchar(200) NOT NULL,
    "votes" integer NOT NULL
);
--
-- Create model Question
--
CREATE TABLE "polls_question" (
    "id" serial NOT NULL PRIMARY KEY,
    "question_text" varchar(200) NOT NULL,
    "pub_date" timestamp with time zone NOT NULL
);

A metaclass plays a critical role in this DSL by instrumenting Model subclasses. It's this metaclass that adds the objects class attribute, a Manager instance that mediates ORM queries, and the class-specific DoesNotExist and MultipleObjectsReturned exceptions.

Because metaclasses control class creation, they're an obvious way to inject these kinds of class-level attributes. For the same reason, but less obviouly, they also provide a place to run initialization hooks that should run only once in a program's lifetime. Classes are generally defined at module level. Thus, classes are created when modules are created. Because of Python's module caching, this means that metaclasses are usually run early and rarely. Django's DSL makes use of this assumption to register models with their applications upon creation.

Running code this soon can lead to strange issues, which make it tricky to use metaclasses correctly. They also rely on subclassing, which is considered harmful. These things and their use in ORMs, which are also considered harmful, might seem to limit their usefulness. However, a base class whose purpose is to inject a metaclass avoids many of the problems associated with subclassing, as little to no functionality will be inherited. Django weighs the benefits of familiar syntax over the costs of subclassing, resulting in a data definition DSL that's ergonomic for Python programmers.

Despite their complexity and shortcomings, metaclasses provide a succinct way to describe and manipulate all kinds of data, from wire protocols to XML documents. They can be just the trick for data-focused DSLs.

Pyramid

Pyramid allows defining web application logic, as opposed to the routing details, anywhere. It will match up the function to the route based on the route name, as defined in the configuration router.

# Removed imports

## The function definition can go anywhere
@view_config(route_name='home')
def my_home(context, request):
    return 'OK'

## This goes in whatever file we pass to our WSGI host
config = Configurator()
config.add_route('home', '/')
config.scan('.')
app = config.make_wsgi_app()

The builder pattern, as seen here, allows gradually creating an application. The methods on Configurator, as well as the decorators such as view_config, are effectively a DSL that helps build web applications.

Plugins

When code lives in real Python modules, and uses real Python APIs, it is sometimes useful for it to be executed automatically based on context. After all, one thing that DSL systems like SCons give us is automatically executing the SConscript when we run scons at the command line.

One tool that can be used for this is a plugin system. While a comprehensive review of plugin systems is beyond our scope here, we will give a few examples of using such systems for specific domains.

One of the oldest plugin systems is twisted.plugin. While it can be used as a generic plugin system, the main usage of it -- and a good case study of using it as a plugin system for DSLs -- is to extend the twist command line. These are also known as tap plugins, for historical reasons.

Here is a minimal example of a Twisted tap plugin:

# Removed imports
@implementer(IServiceMaker, IPlugin)
class SimpleServiceMaker(object):
    tapname = "simple-dsl"
    description = "The Simplest DSLest Plugin"

    class options(usage.Options):
        optParameters = [["port", "p", 1235, "Port number."]]

    def makeService(self, options):
        return internet.TCPServer(int(options["port"]),
                                  Factory.forProtocol(Echo))

serviceMaker = SimpleServiceMaker()

In order to be a valid plugin, this file must be placed under twisted.plugins. The usage.Options class defines a DSL, of sorts, for describing command-line options. We used only a small part of it here, but it is both powerful and flexible.

Note that this is completely valid Python code -- in fact, it will be imported as a module. This allows us to import it as well, and to test it using unit tests.

In fact, because this is regular Python code, usually serviceMakers are created using a helper class -- twisted.application.service.ServiceMaker. The definition above, while correct, is not idiomatic.

The gather library does not have a DSL. It does, however, function well as an agnostic plugin discovery mechanism. Because of that, it can be built into other systems -- that do provide a Pythonic DSL -- to serve as the autodiscovery mechanism.

# In  a central module:
ITS_A_DSLISH_FUNCTION = gather.Collector()

## Define the DSL as
## -- functions that get two parameters
##    -- conf holds some general configuration
##    -- send_result is used to register the result
def run_the_function_named(name, conf, send_result):
    res = ITS_A_DSLISH_FUNCTION.collect()
    return res[name](conf, result)

# In a module registering a DSL function
@ITS_A_DSLISH_FUNCTION.register(name='my_dslish_name')
def some_func(conf, send_result):
    with conf.temp_stuff() as some_thing:
         send_result(some_thing.get_value())

Conclusion

Python is a good language to use for DSLs. So good, in fact, that attrs, a DSL for defining classes, has achieved enormous popularity. Operator overloading, decorators, the with operator and generators, among other things, combine to allow novel usage of the syntax in specific problem domains. The existence of a big body of documentation of the language and its best practices, along with a thriving community of practicioners, is also an asset.

In order to take advantage of all of those, it is important to use Python as Python -- avoid magical execution contexts and novel input search algorithms in favor of the powerful code organization model Python already has -- modules.

Most people who want to use Python as a DSL are also Python programmers. Consider allowing your program's users to use the same tools that have made you successful.

As Glyph said in a related discussion, "do you want to confuse, surprise, and annoy people who may be familiar with Python from elsewhere?" Assuming the answer is "no", consider using real modules as your DSL mechanism.

by Moshe Zadka and Mark Williams at August 08, 2017 04:30 AM

August 07, 2017

Itamar Turner-Trauring

Can we please do useful things with software?

If you want to read something that goes from depressing to exciting and back again every other paragraph, read the monthly Who’s Hiring thread on Hacker News.

When I was younger I didn’t think much about what software I was writing: I wanted to work on interesting problems, and get paid for doing it. So at one point I accepted a job at a financial trading platform, a job I would never take today; luckily I ended up walking away from that offer, in the end.

These days technical problems are still just as fun, but they’re no longer sufficient: I want to do something useful, something that makes the world a tiny bit better. And so it’s sad to see some of the software we programmers are spending our time writing, and exciting to see the useful ways in which software is being applied.

Less of this, please

If you need a job, you need a job, and as long as you’re not doing something you consider unethical or immoral you do what you need to to get by. But if you have the opportunity, why not also do something useful, something that makes the world better?

Adtech? I mean, yes, advertising is kinda sorta maybe useful, if you squint, but at this point I find browsing without an ad blocker positively unpleasant. I miss the days when Google ad results were actually helpful.

Do we need to spend more time making it easy for brands to do anything? Can brands do anything? Does Coca-Cola have a giant glowing disembodied Coke avatar, ensconced deep within the bowels of Coca-Cola Worldwide Headquarters, sending out red ectomorphic tentacles to type out text into a SaaS written by programmers passionate about user engagement? I’d love to sit in on some of your customer interviews, if so, or perhaps just watch a recording from a safe distance.

While we’re at it, can we stop being passionate?

Cryptocurrencies? Do you want to be responsible when the bubble bursts and it turns out capital flight from China is not a good basis for a currency? With tulips at least you had flowers at the end, even if they were bad investments; with cryptocurrencies people will be left with some digits on a USB drive. Woo.

Is it actually necessary to take an existing financial product (annuities, let’s say) and call them something else (pensions, just for example)? Annuities and pensions have very different risk profiles; the former has individual company risk that the latter doesn’t. Is this really worthwhile innovation?

Having previously lived in a different country I do realize the American medical system is a total and utter fuckup. But couldn’t we just switch to single-payer like every other developed country, instead of writing software to put bandaids on a chest wound?

And the world probably doesn’t need another startup whose business model involves taking VC money and giving it to poorly payed contractors in order to make the lives of the upper middle-class a miniscule increment more comfortable. How about a business model involving paying good wages to do something more valuable?

Do something useful

My definition of usefulness is personal and idiosyncratic, of course; I expect you will disagree with at least some of the list above. But there are also plenty of companies that sounds like they’re building something almost anyone would find worthwhile.

“Technology to investigate pressure transients and flow instabilities in water supply networks”? May your hiring pipeline always be full.

“Reducing paperwork”? Sign me up (as long as the form is short).

Next time you’re looking for job, spend a little time upfront thinking about what you think makes a company useful. Interesting technical problems are great, getting paid well is pretty damn good, and a short commute is a joy. But working on something that makes the world a better place will make your own job that much better.

August 07, 2017 04:00 AM

August 03, 2017

Itamar Turner-Trauring

Staying focused: it's not just your environment

To be a productive programmer you need to stayed focused. Deep-diving into TV Tropes, chatting with your friends, or reading up on that fancy new web framework might be fun, often even educational, but they won’t get that feature you’re working on out the door.

And there are harder to spot distractions, digressions masquerading as necessary work: a fun bug that is less important than the one you’re working on, a technical detail that doesn’t really matter, a task that can be put off until later. In a world full of distractions, how can you stay focused?

One obvious influence on your ability to focus is your environment. Is it noisy or quiet, are you constantly interrupted or do you get time to yourself? But whatever environment is best for you, even working in your ideal environment may not suffice: you can still suffer from distraction and lack of focus.

If you want to stay focused you will need, beyond a good environment:

  1. The motivation to do your work, which requires you to understand both yourself and your task.
  2. Coping techniques to help you deal with the fact that focus is a finite resource.

Motivation: why are you doing this?

If you don’t care about your task, then you’ll have a hard time focusing. But once you do understand why you’re doing what you do, you’ll have an easier time staying on task, and you’ll have an easier time distinguishing between necessary subtasks and distracting digressions.

Why are you doing what you’re doing at work? In part, there are general motivations that apply to all your work on the job. For example:

  • Money: Getting paid so you can buy food and shelter.
  • Social pressure: You want your coworkers and boss to think well of you.

The problem with these motivations are that they are extrinsic: they come from the outside. Intrinsic motivations tend to work better. For example:

  • A sense of obligation: You want to help your customers or users.
  • Building and playing: Solving a hard problem is fun.
  • Curiosity: Learning is fun too.

These general motivations will not suffice, however, if you don’t understand why you’re doing a particular task. Why does this data need to be collected? Why do you need to debug this seemingly impossible edge case; does it really matter?

Applying motivation: will this further your goal?

So how do you use motivation to stayed focused?

  1. Figure out the motivations for your task.
  2. Strengthen your motivation.
  3. Judge each part of your work based on your motivations.

1. Discovering your motivations

Start with the big picture: why are you working this job? Probably for the money, hopefully because you believe in the organization’s goal, and perhaps for other reasons as well.

Then focus down on your particular task: why is it necessary? It may be that to answer this question you’ll need do more research, talking to the product owner who requested a feature, or the user who reported a bug. This research will, as an added bonus, also help you solve the problem more effectively.

Combine all of these and you will get a list of motivations that applies to your particular task. For example, let’s say you’re working on a bug in a flight search engine. Your motivations might be:

  1. Money: I work to make money.
  2. Organizational goal: I work here because I think helping people find cheap, convenient flights is worth doing.
  3. Task goal: This bug should be fixed because it prevents users from finding the most convenient flight on certain popular routes.
  4. Fun: This bug involves a challenging C++ problem I enjoy debugging.

2. Strengthening your motivations

Keeping your motivations in mind will help you avoid distractions, and the stronger your motivations the better you’ll do. If your motivations are weak then you can try different solutions:

  • If you work for a company whose goals don’t mean much to you, then you’ll have a harder time focusing: consider finding a new job where you’re doing something you care more about.
  • If after enough research you’ve decided your task is pointless, you can either try to push back (mark the bug as WONTFIX, go talk to the product manager), try to add an additional motivation (is this a good opportunity to learn something new?), or just live with the fact that it’ll take you longer to implement.

3. Judging your work

As you go about solving your task you can use your motivations to judge whether a new potential subtask is worth doing. That is, your motivations can help prevent digressions, seemingly useful tasks that shouldn’t actually be worked on.

Going back to the example above, imagine you encounter some interesting C++ language feature while working on it can be tempting to dive in. But judged by the four motivations it will only serve the fourth motivation, having fun, and likely won’t further your other goals. So if the bug is urgent then you should probably wait until it’s fixed to play around.

On the other hand, if you’re working on a pointless feature, your sole motivation might be “keep my manager happy so I can keep getting paid.” If you have two days to do the task, and it’ll only take two hours to implement it, spending some time getting “distracted” learning a technical skill might help with a different motivation: switching to a more interesting position or job.

Coping with lack of focus

Even if you have an ideal environment and plenty of motivation, you will eventually run out of focus. This happens in two different dimensions:

  1. Time: Many programming tasks will take days or weeks to complete, and won’t fit in the limited window you can stay focused at a time.
  2. Space: There’s only so much code you can keep in your head at once, and most software projects will quickly exceed your limits. That means you can only focus on part of the code at a time.

You can only work around these limitations using a variety of coping techniques:

  • Breaking up larger tasks into smaller tasks: Smaller tasks limit what you need to keep in your head, and can be finished more quickly.
  • Abstractions: Good abstraction boundaries reduce how much you need to keep in your head at a time, and allow you to finish your task more quickly.

Another coping technique I don’t see used quite as often is writing everything down.

Write everything down

You’re working on a hard bug: you’re not sure what’s going on or why the problem occurs, and when you do figure it out it’s going to take a few days to implement. Along the way you will be interrupted by scheduled meetings, coworkers asking questions, your bladder, email, going home for the evening, a weekend vacation, two quick bugs, and a few hundred other distractions. Write everything down and distractions and interruptions will matter far less.

You start by trying out different hypotheses: maybe the bug is in this function, perhaps it’s in the environment, maybe it’s a difference in library versions… Write down all your hypotheses. That way when you get interrupted you won’t forget about them.

You try one hypothesis, and it turns out to be wrong. Write that down so you don’t forget and test it again. Eventually you figure out the real problem: write that down too. That way when you come in the next day you’ll remember what you learned.

Discover another bug along the way? Write that down by filing a ticket, and move on. Have an idea for a feature? Write that down too.

Next you come up with a list of subtasks to actually implement the fix, and then write them down, marking them off as you implement them. You’ll be grateful to your past self when you come back from the weekend and try to remember where you were.

In short: write everything down.

How to stay focused

To stay focused you need to:

  • Work in the best environment you can manage: minimal distractions, appropriate levels of noise, and so on.
  • Understand your motivations, both in general and as applied to this task.
  • Try to strengthen your motivations, by choosing meaningful or interesting work.
  • Judge your work based on how it helps achieve your motivations.
  • Cope with lack of focus by breaking up tasks, using and building abstractions, and writing everything down.

PS: Want to learn more software engineering skills and techniques? I write a weekly email covering one of my mistakes and what you can learn from it.

August 03, 2017 04:00 AM

July 26, 2017

Moshe Zadka

Image Editing with Jupyter

With the news about MS Paint going away from the default MS install, it might be timely to look at other ways to edit images. The most common edit I need to do is to crop images -- and this is what we will use as an example.

My favorite image editing tool is Jupyter. Jupyter needs some encouragement to be an image editor -- and to easily open images. As is often the case, I have a non-pedagogical, but useful, preamble. The preamble turns Jupyter into an image editor.

from matplotlib.pyplot import imshow
import numpy
import PIL
import os

%matplotlib inline

def inline(some_image):
    imshow(numpy.asarray(some_image))

def open(file_name):
    return PIL.Image.open(os.path.expanduser(file_name))

With the boring part done, it is time to edit some images! In the Shopkick birthday party, I had my caricature drawn. I love it -- but it has a whole baggage talking about the birthday party which is irrelevant for uploading to Facebook.

I have downloaded the image from the blog. I use Pillow (the packaging fork of PIL) to open the image.

a=open("~/Downloads/weeeee.jpg")

Then I want to visually inspect the image inline:

inline(a)

I use the crop method, and directly inline it:

inline(a.crop((0,0,1500,1600)))

If this was longer, and more realistic, this would be playing with the numbers back and forth -- and maybe resize, or combine it with other images.

The Pillow library is great, and this way we can inspect the results as we are modifying the image, allowing iterative image editing. For people like me, without a strong steady artist's hand to perfectly select the right circle, this solution works just great!

by Moshe Zadka at July 26, 2017 05:20 AM

July 21, 2017

Itamar Turner-Trauring

Incremental results, not incremental implementation

Update: Added section on iterative development.

You’re working on a large project, bigger than you’ve ever worked on before: how do you ship it on time? How do you ensure it has all the functionality it needs? How do you design something that is too big to fit in your head?

My colleague Rafi Schloming, speaking in the context of the transition to microservices, suggests that focusing on incremental results is fundamentally better than focusing on incremental implementation. This advice will serve you well in most large projects, and to explain why I’d like to tell you the story of a software project I built the wrong way.

A real world example

The wrong way…

I once built a system for efficiently sending streams of data from one source to many servers; the resulting software was run by the company’s ops team. Since I was even more foolish than I am now, I implemented it in the following order, based on the architecture I had come up with:

  1. First I implemented a C++ integration layer for the Python networking framework I was using, so I could write higher performance code.
  2. Then I implemented the messaging protocol and system, based on a research paper I’d found.
  3. Finally, I handed the code over to ops.

As you can see, I implemented my project based on its architecture: first the bottom layer, then the layers that built on top of it. Unfortunately, since I hadn’t consulted ops enough about the design they then had to make some changes on their own. As a result, it took six months to a year until the code was actually being used in production.

…and the right way

How would I have built my tool to deliver incremental results?

  1. Build a working tool in pure Python. This would probably have been too slow for some of the higher-speed message streams.
  2. Hand initial tool over to ops. Ops could then start using it for slower streams, and provide feedback on the design.
  3. Next I would have fixed any problems reported by ops.
  4. Finally, I would rewrite the core networking in C++ for performance.

Notice that this is seemingly less efficient than my original plan, since it involves re-implementing some code. Nonetheless I believe it would have resulted in the project going live much sooner.

Why incremental results are better

Incremental results means you focus on getting results as quickly as possible, even if you can’t get all the desired results with initial versions. That means:

  • Faster feedback: You can start using the software earlier, and therefore get feedback earlier. In may case I would have learned about ops’ use cases and problems months earlier, and could have incorporated their suggestions into my design. Instead, they had to patch the code themselves.
  • Less unnecessary features: By focusing on results you’re less likely to get distracted by things you think you need. I believed that Python wouldn’t have been fast enough, so I spent a lot of time upfront using C++. And the C++ version was definitely quite fast, faster than Python could do. But maybe a Python version would have been fast enough.
  • Less cancellation risk: The faster you deliver a project, the faster you can demonstrate results, and so the less risk of your big project being canceled half-way.
  • Less deployment risk: Instead of turning on a single, giant deliverable, you will start by deploying a simpler version, and then upgrading it over time. That means more operational knowledge of the software, and less risk when you turn it on the first time.

Beyond iterative development

“Iterative development” is a common, and good, suggestion for software development, but it’s not quite the same as focusing on incremental results. In iterative development you build your full application end-to-end, and then in each released iteration you make the functionality work better. In that sense, the better alternative I was suggesting above could be seen as simply suggesting iterative development. But incremental results is a more broadly applicable idea than iterative development.

Incremental results are the goal; iterative development is one possible technique to achieve that goal. Sometimes you can achieve incremental results without iterative development:

  • If each individual feature provides value on its own then you can get incremental results with less iterative, more cumulative development. That is, you don’t need to start with end-to-end product and then flesh out the details, you can just deliver one feature at a time.
  • Incremental results aren’t just about your development process, they are also about how results are delivered. For example, if you’re streaming a website to a browser there are two ways to send images: from top to bottom, or starting with a blurry image and getting progressively sharper. With a fast connection either choice works. With a slow connection progressive sharpening is superior because it provides information much sooner: incremental results.

Whenever you can, aim for incremental results: it will reduce the risks, and make your project valuable much earlier. It may mean some wasted effort, yes, as you re-implement certain features, but that waste is usually outweighed by the reduced risk and faster feedback you’ll get from incremental results.

PS: I’ve made lots of other mistakes in my career. If you’d like to learn how to avoid them, sign up for my newsletter, where every week I write up one of my mistakes and how you can avoid it.

July 21, 2017 04:00 AM

July 20, 2017

Moshe Zadka

Anatomy of a Multi-Stage Docker Build

Docker, in recent versions, has introduced multi-stage build. This allows separating the build environment from the runtime envrionment much more easily than before.

In order to demonstrate this, we will write a minimal Flask app and run it with Twisted using its WSGI support.

The Flask application itself is the smallest demo app, straight from any number of Flask tutorials:

# src/msbdemo/wsgi.py
from flask import Flask
app = Flask("msbdemo")
@app.route("/")
def hello():
    return "If you are seeing this, the multi-stage build succeeded"

The setup.py file, similarly, is the minimal one from any number of Python packaging tutorials:

import setuptools
setuptools.setup(
    name='msbdemo',
    version='0.0.1',
    url='https://github.com/moshez/msbdemo',
    author='Moshe Zadka',
    author_email='zadka.moshe@gmail.com',
    packages=setuptools.find_packages(),
    install_requires=['flask'],
)

The interesting stuff is in the Dockefile. It is interesting enough that we will go through it line by line:

FROM python:2.7.13

We start from a "fat" Python docker image -- one with the Python headers installed, and the ability to compile extensions.

RUN virtualenv /buildenv

We create a custom virtual environment for the build process.

RUN /buildenv/bin/pip install pex wheel

We install the build tools -- in this case, wheel, which will let us build wheels, and pex, which will let us build single file executables.

RUN mkdir /wheels

We create a custom directory to put all of our wheels. Note that we will not install those wheels in this docker image.

COPY src /src

We copy our minimal Flask-based application's source code into the docker image.

RUN /buildenv/bin/pip wheel --no-binary :all: \
                            twisted /src \
                            --wheel-dir /wheels

We build the wheels. We take care to manually build wheels ourselves, since pex, right now, cannot handle manylinux binary wheels.

RUN /buildenv/bin/pex --find-links /wheels --no-index \
                      twisted msbdemo -o /mnt/src/twist.pex -m twisted

We build the twisted and msbdemo wheels, togther with any recursive dependencies, into a Pex file -- a single file executable.

FROM python:2.7.13-slim

This is where the magic happens. A second FROM line starts a new docker image build. The previous images are available -- but only inside this Dockerfile -- for copying files from. Luckily, we have a file ready to copy: the output of the Pex build process.

COPY --from=0 /mnt/src/twist.pex /root

The --from=0 indicates copying from a previously built image, rather than the so-called "build context". In theory, any number of builds can take place in one Dockefile. While only the last one will actually result in a permanent image, the others are all available as targets for --from copying. In practice, two stages are usually enough.

ENTRYPOINT ["/root/twist.pex", "web", "--wsgi", "msbdemo.wsgi.app", \
            "--port", "tcp:80"]

Finally, we use Twisted as our WSGI container. Since we bound the Pex file to the -m twisted package execution, all we need to is run the web plugin, ask it to run a wsgi container, and give it the logical (module) path to our WSGI app.

Using Docker multi-stage builds has allowed us to create a Docker container for production with:

  • A smaller footprint (using the "slim" image as base)
  • Few layers (only adding two layers to the base slim image)

The biggest benefit is that it let us do so with one Dockerfile, with no extra machinery.

by Moshe Zadka at July 20, 2017 04:30 AM

July 18, 2017

Glyph Lefkowitz

Beyond ThunderDock

This weekend I found myself pleased to receive a Kensington SD5000T Thunderbolt 3 Docking Station.

Some of its functionality was a bit of a weird surprise.

The Setup

Due to my ... accretive history with computer purchases, I have 3 things on my desk at home: a USB-C macbook pro, a 27" Thunderbolt iMac, and an older 27" Dell display, which is old enough at this point that I can’t link it to you. Please do not take this to be some kind of totally sweet setup. It would just be somewhat pointlessly expensive to replace this jumble with something nicer. I purchased the dock because I want to have one cable to connect me to power & both displays.

For those not familiar, iMacs of a certain vintage1 can be jury-rigged to behave as Thunderbolt displays with limited functionality (no access from the guest system to the iMac’s ethernet port, for example), using Target Display Mode, which extends their useful lifespan somewhat. (This machine is still, relatively speaking, a powerhouse, so it’s not quite dead yet; but it’s nice to be able to swap in my laptop and use the big screen.)

The Link-up

On the back of the Thunderbolt dock, there are 2 Thunderbolt 3 ports. I plugged the first one into a Thunderbolt 3 to Thunderbolt 2 adapter which connects to the back of the iMac, and the second one into the Macbook directly. The Dell display plugs into the DisplayPort; I connected my network to the Ethernet port of the dock. My mouse, keyboard, and iPhone were plugged into the USB ports on the dock.

The Problem

I set it up and at first it seemed to be delivering on the “one cable” promise of thunderbolt 3. But then I switched WiFi off to test the speed of the wired network and was surprised to see that it didn’t see the dock’s ethernet port at all. Flipping wifi back on, I looked over at my router’s control panel and noticed that a new device (with the expected manufacturer) was on my network. nmap seemed to indicate that it was... running exactly the network services I expected to see on my iMac. VNCing into the iMac to see what was going on, I popped open the Network system preference pane, and right there alongside all the other devices, was the thunderbolt dock’s ethernet device.

The Punch Line

Despite the miasma of confusion surrounding USB-C and Thunderbolt 32, the surprise here is that apparently Thunderbolt is Thunderbolt, and (for this device at least) Thunderbolt devices connected across the same bus can happily drive whatever they’re plugged in to. The Thunderbolt 2 to 3 adapter isn’t just a fancy way of plugging in hard drives and displays with the older connector; as far as I can tell all the functionality of the Thunderbolt interface remains intact as both “host” and “guest”. It’s like having an ethernet switch for your PCI bus.

What this meant is that when I unplugged everything and then carefully plugged in the iMac before the Macbook, it happily lit up the Dell display, and connected to all the USB devices plugged into the USB hub. When I plugged the laptop in, it happily started charging, but since it didn’t “own” the other devices, nothing else connected to it.

Conclusion

This dock works a little bit too well; when I “dock” now I have to carefully plug in the laptop first, give it a moment to grab all the devices so that it “owns” them, then plug in the iMac, then use this handy app to tell the iMac to enter Target Display mode.

On the other hand, this does also mean that I can quickly toggle between “everything is plugged in to the iMac” and “everything is plugged in to the MacBook” just by disconnecting and reconnecting a single cable, which is pretty neat.


  1. Sadly, not the most recent fancy 5K ones. 

  2. which are, simultaneously, both the same thing and not the same thing. 

by Glyph at July 18, 2017 07:11 AM

Moshe Zadka

Bash is Unmaintainable Python

(Thanks to Aahz, Roy Williams, Yarko Tymciurak, and Naomi Ceder for feedback. Any mistakes that remain are mine alone.)

In the post about building Docker applications, I had the following Python script:

import datetime, subprocess
tag = datetime.datetime.utcnow().isoformat()
tag = tag.replace(':', '-').replace('.', '-')
for ext in ['', '-slim']:
    image = "moshez/python36{}:{}".format(ext, tag)
    orig = "python:3.6{}".format(ext)
    subprocess.check_call(["docker", "pull", orig])
    subprocess.check_call(["docker", "tag", orig, image])
    subprocess.check_call(["docker", "push", image])

I showed this script to two audiences, in two versions of the talk. One, a Python beginner audience, mostly new to Docker. Another, a Docker-centric audience, with varying levels of familiarity with Python. I gave excuses for why this script is in Python, rather than the obvious choice of shell scripting for automating command-line utilities.

None of the excuses were the true reason.

Note that in a talk, things are simplified. Typical scripts in the real world would not be 10 lines or so. They start out 10 lines, of course, but then have to account for edge cases, extra use cases, random bugs in the services that need to be worked around, and so on. I am more used to writing scripts for production than writing scripts for talks.

The true reason the script is in Python is that I have started doing all my "shell" scripting in Python recently, and I am never going back. Unix shell scripting is pretty much writing in unmaintainable Python. Before making the case for that, I am going to take a step in the other direction. The script above took care to only use the standard library. If it could take advantage of third party libraries, I would have written it this way:

import datetime, subprocess
import seashore
xctr = seashore.Executor(seashore.Shell())
tag = datetime.datetime.utcnow().isoformat()
tag = tag.replace(':', '-').replace('.', '-')
for ext in ['', '-slim']:
    image = "moshez/python36{}:{}".format(ext, tag)
    orig = "python:3.6{}".format(ext)
    xctr.docker.pull(orig)
    xctr.docker.tag(orig, image)
    xctr.docker.push(image)

But what if I went the other way?

import datetime, subprocess
tag = datetime.datetime.utcnow().isoformat()
tag = tag.replace(':', '-').replace('.', '-')
for ext in ['', '-slim']:
    image = "moshez/python36{}:{}".format(ext, tag)
    orig = "python:3.6{}".format(ext)
    subprocess.check_call("docker pull " + orig, shell=True)
    subprocess.check_call("docker tag " + orig + " " + image, shell=True)
    subprocess.check_call("docker push " + image, shell=True)

Note that using shell=True is discouraged, and is generally a bad idea. We will revisit why later. If I were using Python 3.6, I could even have the last three lines be:

subprocess.check_call(f"docker pull {orig}", shell=True)
subprocess.check_call(f"docker tag {orig} {image}", shell=True)
subprocess.check_call(f"docker push {image}", shell=True)

or I could even combine them:

subprocess.check_call(f"docker pull {orig} && "
                      f"docker tag {orig} {image} && "
                      f"docker push {image}", shell=True)

What about calculating the tag?

tag = subprocess.check_output("date --utc --rfc-3339=ns | "
                              "sed -e 's/ /T/' -e 's/:/-/g' "
                                  "-e 's/\./-/g' -e 's/\+.*//'",
                              shell=True)

Putting it all together, we would have

import subprocess
tag = subprocess.check_output("date --utc --rfc-3339=ns | "
                              "sed -e 's/ /T/' -e 's/:/-/g' "
                                  "-e 's/\./-/g' -e 's/\+.*//'",
                              shell=True)
for ext in ['', '-slim']:
    image = f"moshez/python36{ext}:{tag}"
    orig = f"python:3.6{ext}"
    subprocess.check_call(f"docker pull {orig} && "
                          f"docker tag {orig} {image} && "
                          f"docker push {image}", shell=True)

None of the changes we made were strictly improvements. They mostly made the code harder to read and more fragile. But now that we have done them, it is straightforward to convert it to a shell script:

#!/bin/sh
set -e
tag = $(date --utc --rfc-3339=ns |
        sed -e 's/ /T/' -e 's/:/-/g' \
        -e 's/\./-/g' -e 's/\+.*//')
for ext in '' '-slim'
do
    image = "moshez/python36$ext:$tag"
    orig = "python:3.6$ext
    docker pull $orig
    docker tag $orig $image
    docker push $image
done

Making our script worse and worse makes a Python script into a shell script. Not just a shell script -- this is arguably idiomatic shell. It uses -e, long options for legibility, and so on. Note that the shell does not even have a way to express a notion like shell=False. In a script without arguments, like this one, this merely means changes are dangerous. In a script with arguments, it means that input handling safely is difficult (and unlikely to happen). Indeed, this is why shell=False is the default, and recommended, approach in Python.

In this case, one that does little but automate unix commands, the primary use-case of shell scripts. It stands to reason that the reverse process -- making a shell script into Python, would have the reverse effect: making for more maintainable, less fragile code.

As an exercise of "going the other way", we will start with a simplified version of shell script

set -e

if [ $# != 3 ]; then
    echo "Invalid arguments: $*";
    exit 1;
fi;

PR_NUMBER="$1"; shift;
TICKET_NUMBER="$1"; shift;
BRANCH_NAME="$1"; shift;


repo="git@github.com:twisted/twisted.git";
wc="$(dirname "$(dirname "$0")")/.git";

if [ ! -d "${wc}" ]; then
  wc="$(mktemp -d -t twisted.XXXX)";

  git clone --depth 1 --progress "${repo}" "${wc}";

  cloned=true;
else
  cloned=false;
fi;

cd "${wc}";

git fetch origin "refs/pull/${PR_NUMBER}/head";
git push origin "FETCH_HEAD:refs/heads/${TICKET_NUMBER}-${BRANCH_NAME}";

if ${cloned}; then
  rm -fr "${wc}";
fi;

How would it look like, with Python and seashore?

import os
import shutil
import sys

import seashore

if len(sys.argv) != 4:
    sys.exit("Invalid arguments: " + ' '.join(sys.argv))

PR_NUMBER, TICKET_NUMBER, BRANCH_NAME = sys.argv[1:]

xctr = seashore.Executor(seashore.Shell())
repo="git@github.com:twisted/twisted.git";
wc=os.path.dirname(os.path.dirname(sys.argv[0])) + '/.git'
if not os.path.isdir(wc):
    wc = tempfile.mkdtemp(prefix='twisted')
    xctr.git.clone(repo, wc, depth=1, progress=None)
    cloned = True
else:
    cloned = False

xctr = xctr.chdir(wc)
xctr.git.fetch(origin, f"refs/pull/{PR_NUMBER}/head")
xctr.git.push(origin,
              f"FETCH_HEAD:refs/heads/{TICKET_NUMBER}-{BRANCH_NAME}")
if cloned:
    shutil.rmtree(wc)

The code is no longer, more explicit, and -- had we wanted to -- easier to now refactor into unit-testable functions.

If this is, indeed, the general case, we can skip that stage entirely: write the script in Python to begin with. When it inevitably increases in scope, it will already be in a language that supports modules and unit tests.

by Moshe Zadka at July 18, 2017 05:20 AM

July 16, 2017

Itamar Turner-Trauring

Beyond fad frameworks: which programming skills are in demand?

Which programming skills should spend your limited time and energy on, which engineering skills are really in demand? There will always be another fad framework that will soon fade from memory; the time you spend learning it might turn out to be wasted. And job listings ask for ever-changing, not very reasonable combinations of skills: “We want 5 years experience with AngularJS, a deep knowledge of machine learning, and a passion for flyfishing!”

Which skills are really in demand, which will continue to be in demand, and which can safely be ignored? The truth is that the skills employers want are not the skills they actually need: the gap between the two can be a problem, but if you present yourself right it can also be an opportunity.

What employers want

What employers typically want is someone who will get going quickly, with as short a ramp-up time as possible and as little training as possible. While perhaps short-sighted, this certainly seems to be the default mindset. There are two failure modes:

  1. Over-focusing on implementation skills, rather than problem solving skills: “We use AngularJS, therefore we must hire someone who already knows AngularJS!” If it turns out AngularJS doesn’t work when the requirements change, hiring only for AngularJS skills will prove problematic.
  2. Hiring based on a hypothetical solution: “I hear that well-known company succeeded using microservices, so we should switch to microservices. We must hire someone who already knows microservices!” If that turns out to be the wrong solution, hiring someone to implement it will not turn out well.

What employers need

What employers actually need is someone who will identify and solve their problems. An organization’s goal is never really to use AngularJS or switch to microservices: it’s to sell a product or service, help some group of people, promote some point of view, and so on. Employers need employees who will help further these goals.

That doesn’t necessarily require knowing the employer’s existing technology stack, or having a working knowledge of trendy technologies. Someone who can quickly learn the existing codebase and technologies, identify the big-picture problems, and then come up with and implement a good solution: that is what employers really need.

This can build on a broad range of skills, including:

What you should do

Given this gap between what employers want and what they need, what should you do?

  1. Learn the problem solving skills that employers will always need. That means gathering requirements, coming up with efficient solutions, project management, and so on.
  2. Learn some long-lasting popular technologies in-depth. Relational databases have been around since the 1980’s and aren’t going anywhere: if you really understand how to structure data, the concurrency model, the impact of disk storage and layout, and so on, learning some other database like MongoDB will be easy (although perhaps a little horrifying). Similarly, decades after their creations languages like Python or Java are still going strong, and if you know one well you’ll have an easy time learning many other languages.
  3. Dabble, minimally, in some trendy technologies. If you occasionally spend 30 minutes going through the tutorial for the web framework of the month, when it’s time to interview you say can say “I played with it a little.” This will also help you with the next item.
  4. Learn how to learn new technologies quickly.

Then, when it’s time to look for a job, ignore the list of technology requirements when applying, presuming you think you can do the job: it’s what the company wants, not what they need.

Interviewing is about marketing, about how you present yourself. So in your cover letter, and when you interview, emphasize all the ways you can address what they want in other ways, and try to point out ways in which you can help do what they actually need:

  • Getting started quickly with minimal training: “I can learn new codebases and technologies quickly, as I did at my last job when I joined the Foo team, learned how to use Bar in a month, and built Baz in just two months.”
  • Needs that are implicit in the company’s situation: “I see you’re a growing company; I have previous experience helping an engineering team grow under similar circumstances.”
  • Needs that are implicit in the job description: “I identified this big costly problem, not unlike yours, and solved it like this, using these technologies.”

Learning every new web framework isn’t necessary to get a job. Yes, technologies do change over the years: that means you need to be able to learn new technologies quickly. But just as important as technology skills are those that will make you valuable—the ability to identify and solve problems—and the skill that will make your value clear: the ability to market yourself.

July 16, 2017 04:00 AM

July 10, 2017

Itamar Turner-Trauring

Stop writing software, start solving problems

As software engineers we often suffer from an occupational hazard: we enjoy programming. Somewhere in college or high school we discovered that writing code is fun, and so we chose a profession that allowed us to both get paid and enjoy ourselves. And what could be wrong with that?

The problem is that our job as software engineers is not to write code: our job is to solve problems. And if we get distracted by the fun of programming we often do worse at solving those problems.

The siren call of programming

I’ve been coding since 1995, and I have to admit, I enjoy programming. My working life is therefore a constant struggle against the temptation to do so.

Recently, for example, I encountered a problem in the Softcover book publishing system, which converts Markdown into various e-book formats. I’d been working on The Programmer’s Guide to a Sane Workweek, and reached the point of needing to render the text into a nicely layed out PDF.

Softcover renders Markdown blockquotes like these:

> This is my story.

into LaTex \quote{} environments like this one:

\begin{quote}
This is my story.
\end{quote}

I wanted the output to be a custom LaTeX environment, so I could customize the PDF output to look a particular way:

\begin{mycustomquote}
This is my story.
\end{mycustomquote}

This is the point where programming began calling out to me: “Write code! Contribute to the open source community! Submit a patch upstream!” I would need to:

  1. Learn the Softcover code base just enough to find the relevant transformation.
  2. Learn just enough enough more Ruby to modify the code.
  3. Figure out how to make the output customizable, write a test or three, and then submit a patch.

This probably would have taken me an afternoon. It would have been fun, and I would have felt good about myself.

But my goal is not to write software: my goal is to solve problems, and the problem in this case is spitting out the correct LaTeX so I can control my book’s formatting. And so instead of spending an afternoon on it, I spent five minutes writing the following Makefile:

build-pdf:
	rm -rf generated_polytex/*.tex
	softcover build:pdf
	sed 's/{quote}/{mycustomquote}/g' -i generated_polytex/*.tex
	softcover build:pdf

This is a horrible hack: I’m relying on the fact that building a PDF generates TeX files if they don’t already exist, but uses existing ones if they are there and newer than the source. So I build the PDF, modify the generated TeX files in place, and then rebuild the PDF with the modified files.

I would never do anything like this if I were building a production system used by customers. But this isn’t a production system, and there are no customers: it’s a script only I will ever run, and I run it manually. It’s not elegant, but then it doesn’t have to be.

I solved my problem, and I solved it efficiently.

Stop writing code, start solving problems

Don’t write code just because it’s fun—instead, solve the problem the right way:

  • Sometimes that means writing no code at all, because you can solve the problem with some Post-It notes on the wall.
  • Sometimes that means writing boring tests, even though it’s no fun to write tests.
  • Sometimes that means reusing someone else’s library, even though it’s much more fun to write your own version.

You can write software for fun, of course: programming makes a fine hobby. But when you’re working, when you’re trying to get a product shipped, when you’re trying to get a bug fixed: be a professional, and focus on solving the problem.

PS: Coding for fun when I should’ve been solving problems is just is one of the many programming mistakes I’ve made over the years. Sign up for my Software Clown newsletter and every week you’ll hear the story of one my engineering or career mistakes and how you can avoid it.

July 10, 2017 04:00 AM