Escape the scripter mentality if reliability matters (2011)

rkachowski · on July 27, 2019

I would always reach for scripting every time. Getting a working implementation and experiencing real world data + problems as soon as possible is always best if you are taking an iterative approach.

Unless there are some serious domain experts on hand, I feel mapping the problem space is the most valuable approach - and one of the best ways to do this is to attempt a fast and basic solution. The alternative is a big design up front with minimal experience in the subjext domain.

It feels like the inverse of "I'm going to scale my foot up your ass"[1] - how useful are your reliability and metrics if your service is never used / becomes irrelevant / newspaper business dies / company folds / huge paradigm shift in tech.

I feel the dirty hack is a valuable step in the process, and its not possible to jump to the end without mapping the route first.

[1] http://widgetsandshit.com/teddziuba/2008/04/im-going-to-scal...

lumost · on July 27, 2019

From experience most of the time when an engineer can't write down a design for what they'll be delivering over the next 6-12 weeks, they won't be able to deliver the desired outcome in shorter iteration cycles either.

pmontra · on July 27, 2019

6 to 12 weeks? I used to have 6 to 12 months and a team that delivered software and designs. Change of job, my customers ask me to deliver features in the next 6 or 12 days now, and I work less than 50% of the time for any them. Design is made on paper or only in my mind, documented in markdown (when I have plenty of time) or comments on YouTrack or kanbanize (customer choice), software written in Elixir, Python, Ruby, some JS. Sometimes I complain that they want to rush sw into production without letting it settle down in staging, then lose time fixing unexpected cases.

TheOtherHobbes · on July 27, 2019

It depends on the domain. CRUD should be by the numbers, unless you're working at planetary scale.

But some domains are very poorly understood. You won't get far by trying to iterate - because you don't even know what the problem is, never mind how to solve it, and your dirty hack is going to introduce naive assumptions that turn out to be spectacularly wrong.

smcameron · on July 27, 2019

Though as John Gall famously said, "A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system."

greggyb · on July 27, 2019

Coming from a perspective of the IT consulting world, CRUD apps are rarely that simple.

The core is always the same, sure - you're basically wrapping read and write transactions to a storage subsystem in a pretty veneer. The complexity is never in the CRUD actions themselves. But transforming the things that have been R-ed so we can U them or identifying the full population of things to be D-ed is often non-trivial and deeply tied into understanding of the business domain and process at hand.

Often (analytics and reporting lenses on here), the complexity is in modeling the data appropriately and the logic of transformation. The work of presenting this stuff to a user is basically trivial arithmetic.

kd5bjo · on July 27, 2019

In this sort of situation, it's always useful to ask what the cost of not automating the task at all is. That will give everyone a better feel for whether the automation makes sense, and how much developer time should be allocated to the project.

> Let's say I want you to get the fifth word of the fourth paragraph of the third column of the second page of the first edition of the local paper in a town to be determined. It's going to be a color, and we have a little deal with that paper to get our data plugged in every day. You can get the feed from their web site.

For the one-off case, I'd just go open up the paper and look for myself. Surely that'll be faster than trying to write some sort of a script to do it.

> Now I want you to be able to do this reliably every day for the next two years. I need this data on a regular delivery schedule and it can't rely on some human being there to constantly fine-tune things.

Ok. So now we're automating a 5 minute job that can be done by the office assistant, which will be performed 730 times. That puts the upper bound of time saved at around 120 hours of work. Due to their specialized training, software engineers probably cost the company more than 3x per hour than the assistant, so this automation task only makes sense if it can be completed in less than a week of work for the engineer, including maintenance over the next two years.

That sounds like a reasonable, but tight, time budget for the given task and reliability specifications. It's not an obvious win, which also means the ROI will be small. Also, there's opportunity cost to consider: that week of development time now can't be spent on other projects that may be more urgent.

andrewflnr · on July 27, 2019

> ... if reliability matters

Your analysis ignores the cost of failure to get the right word, or maybe just assumes the office assistant is infallible. That cost is almost certainly non-zero, and might be much higher than the whole development cost.

kd5bjo · on July 27, 2019

The task as described included an easy for a human but tricky for a computer verification that the correct word will always be a color. Given this, and the need for the system to be tolerant against human error occurring at the paper, I believe a typical untrained human will have a better reliability on this task than almost any automated system.

But in the general case you’re absolutely right: any analysis like this should include the costs and probability of faults occurring in each option.

andrewflnr · on July 27, 2019

Sure, but maybe the assistant gets distracted by other work, or gets sick and don't get a chance to make sure someone else does it, or quits because their boss is a jerk, or any number of other things going wrong. At least a few of these will happen over the course of two years.

kd5bjo · on July 27, 2019

Note that I got my analysis above slightly wrong: 730 times 5 minutes is 60.8 hours, not 120. According to the new numbers, the break even is 20 hours of engineering work (or adjust the engineer/assistant multiplier; I picked 3x to make the math work out cleanly)

brachi · on July 27, 2019

> So now we're automating a 5 minute job that can be done by the office assistant, which will be performed 730 times.

Mandatory xkcd reference: https://xkcd.com/1205/

crispyambulance · on July 27, 2019

The first thing that works is just fine, for a while, until it's not.

Then, you need to evolve to something else that better fits the job at hand. This is not a technical problem or even one of "mentality", I think.

It's more of an organizational challenge. It means acquiring the level of agency needed to adapt or evolve solutions to problems.

If you need to dig a trench, once, in your backyard, sure a pick axe and shovel and some hard-labor is just fine. If you need to dig a trench every day... you need a backhoe. Sadly so many organizations choose to do the equivalent of operating "chain-gangs" to dig trenches with pick-axes and shovels, at scale, every day. The people in charge of these chain-gangs just don't know any better and the people digging don't have the agency to demand the right tools and right approach.

smacktoward · on July 27, 2019

> The people in charge of these chain-gangs just don't know any better

This is a dangerous assumption. It seems at least equally likely that they are just responding to the incentives the system presents them. In other words, when you see a chain gang, you're seeing an organization that views hardware as expensive and people as cheap.

derefr · on July 27, 2019

I think, in this case, the “chain gang” might not be a gang of people, but rather a gang of e.g. Unix scripts in Docker containers scheduled onto a K8s cluster via SQS from S3 lifecycle events. (I.e. the original software written to solve the 1x task, now chain-ganged together to solve an Nx task.)

In cases like this, where both the solutions are software, the reason the company is relying on a progressively Matryoshka'ed system built on top of the original, "dumb" solution, is usually that they value developer time (to build a better solution) greatly over ops-staff time (to implement the infrastructure required to scale the dumb solution) + additional hardware costs (for all the overhead the dumb solution's method of scaling introduces.)

But even that doesn't explain why organizations refuse to switch from bailing-wire scripts to a pre-built, commonly-available infrastructure-component better solution. In such cases, both staying and switching are ops costs.

kd5bjo · on July 27, 2019

> But even that doesn't explain why organizations refuse to switch from bailing-wire scripts to a pre-built, commonly-available infrastructure-component better solution. In such cases, both staying and switching are ops costs.

The costs of staying are known, because they’re the ones you’ve already been paying for a while. Switching to a new system is inherently risky as there’s always a significant chance you haven’t correctly identified all of the requirements from the existing system.

dkersten · on July 27, 2019

I don’t know. Its easier (ie cheaper) to write a quick script and fix it when it breaks than it is to try and anticipate what unknown unknowns might crop up over the next two years.Absolutely, think through the edge cases, but there’s little point in trying to anticipate things you really don’t know about and if it was a quick script to write, it’ll be quick to write a new one that works in the new situation (and if not, well, you have more information now to implement a better solution than you did at the start).

It depends on how critical the task is. Can you detect an error occurred that you now need to fix? Can you tolerate the lag between error and fix? etc

haddr · on July 27, 2019

Scripting is great, especially for handling one time jobs, for doing PoCs it short-lived projects. You can be really productive and just get the job done. But operating something on a regular basis is a different thing. Productionize such thing invokes a lot of different aspects to handle and this articles is mainly about it: showing cases where evolving “scripting solutions” simply doesn’t scale. Too much glue and too few control over the main aspects of the problem.

I saw solutions that started like this and very quickly became an unmanageable mess. I guess it’s some sort of pattern where it’s hard to see this fine line between “it works so don’t need to reengineer it” and realising a huge technical debt incurred. It requires experienced person to make a decision to drop the former and start with some more sane solution.

wodenokoto · on July 27, 2019

The story starts with inconsistently formatted stream of text data and ends with nicely formatted binary data.

Where does she get this nice data, where integers are integers and fields have nice names?

jt2190 · on July 27, 2019

Yeah, I think that the newspaper example obscured her larger point that converting machine-readable data into human-readable data and then using text parsing is both error-prone and unnecessary compared to just using tools that work directly with the machine-readable data.

Edit: The meat of the article:

> This gets into a whole thing I call "scripter mentality". It seems like some people would rather call (say) tcpdump and parse the results instead of writing their own little program which uses libpcap. Calling tcpdump means you have to do the pipe, fork, dup2, exec, parse thing. Using libpcap means you just have to deal with a stream of data arriving that you'd have to chew on anyway.

Edit 2: “Scripter” here refers to operating on text files. The Rule of Composition [1] encourages the use of text processing, for example:

> Text streams are to Unix tools as messages are to objects in an object-oriented setting. The simplicity of the text-stream interface enforces the encapsulation of the tools. More elaborate forms of inter-process communication, such as remote procedure calls, show a tendency to involve programs with each others' internals too much.

[1] The Art of Unix Programming: http://www.catb.org/~esr/writings/taoup/html/ch01s06.html#id...

kd5bjo · on July 27, 2019

Also, how does reading the data from the original source solve the business need of verifying that the newspaper included the information in the agreed-upon place in the print edition?

lilyball · on July 27, 2019

The binary data at the end is a different story than the text data at the beginning.

XCSme · on July 27, 2019

I had to manually download all the invoices from a marketplace each month as they had no "download all" button. I created a script to (download list of invoice names, get the invoice PDF based on name, save PDF in folder). All this was done by hardcoding paths and assuming stuff doesn't change, I have implemented this script in a few hours and have been using this it for over two years without any issues (doing each month in 2minutes what previously took 1 hour). What's the problem with that? Yes, sometimes I had small isues (eg. I was being rate limited trying to download PDFs, but quickly added a request delay to fix that issue and it worked again. I think it's a lot faster/more productive to implement what's quickest and adapt along the way, than spend a lot of time implementing the "perfect solution" which probably has the same chances of failing as the quick-and-dirty script.

ozten · on July 27, 2019

This sounds perfect. I think her concern is if your script became the basis for a new product. Now your company depends on it running more often and no one is manually reviewing the output.

XCSme · on July 27, 2019

The final point in my comment was that even if you spend a lot of time trying to make a more reliable product instead of a quick script, you still can't be sure it will never break in the future as the data source can become invalid at any point. Imagine a news website, they update their platform and suddenly all 404 pages actually redirect to first article/post in database instead of correctly returning 404, so your product would crawl the same data each day and say everything is fine, even though it's not. There are infinite such problems that can arise, so even if you have a well-thought product, it still needs some human sanity-checks and updates once in a while.

danjc · on July 27, 2019

This week I met with a potential client for our integration platform (iPaaS). Their business centers around interpreting financial data and that means ingesting raw data from different sources.

The guy who handles this has set up a SQL database and pulls data in via FTP and email. Outlook sits open on a server so that Outlook message rules can be used to fire up a VBScript which in turn calls a stored procedure to ingest the data. There are a bunch of combinations of these to deal with different feeds.

He's a brilliant guy and he's used the tools he knows (not a dev) but it's brittle and only he knows how to support it. It's unlikely they'll move to our platform (invasion of his turf) but I do hope he finds another solution that's more reliable.

statictype · on July 28, 2019

Sounds like an iPaaS is exactly what they need and since hes not a dev he may even embrace it?

What’s your iPaaS product?

danjc · on July 28, 2019

The feedback I had indirectly is that he feels that replacing his work with a platform would negate the value of what he's built so far. We're flowgear.net

m0nty · on July 27, 2019

> You're going to have to invest the time to do it right up front

This is the problem with most places I have worked: "I want this and I need it today." Scripting is the obvious answer.

Then it will change to "I want it to run every day" and even if I say "it needs more work" the response will be "well, it's working now, so why do you need to fix it?" And so a shell script or several go into production and I can't stop it.

In any case, even if you use something more formal, you can still write fragile, error-prone code. Bash or Perl is not the problem - you can do great work with those tools, if you have time.

quickthrower2 · on July 27, 2019

Any boss can conjure up unrealistic demands. If it’s not possible to have a reasonable dialogue about it from the beginning, and that communication issue can’t be fixed maybe look for another job?

twic · on July 27, 2019

> Now I want you to be able to do this reliably every day for the next two years. I need this data on a regular delivery schedule and it can't rely on some human being there to constantly fine-tune things.

If you're going to stack the deck like that, then sure, scripting isn't the right approach here.

But in reality, for many things, automation that works 90% of the time and reliably attracts manual intervention the rest of the time is often fine. If you can build it quickly, it wins out over something more robust that takes significant investment to build.

adrianN · on July 27, 2019

It can actually better than automation that works flawlessly for two years and then breaks or needs an upgrade. While nobody was looking at it, knowledge of its operation got lost.

oweiler · on July 27, 2019

I only use Bash for the simplest tasks. For anything more elaborate I use Groovy scripts. At some point I switch to full Groovy projects, switching on static compilation, adding unit and integration tests and so forth.

vorg · on July 27, 2019

> At some point I switch to full Groovy projects, switching on static compilation

If you let your Apache Groovy scripts get too large before switching on static compilation, you often have to modify the types and logic in your programs before they'll even compile. This problem occurs because static compilation was only bolted onto Groovy for version 2.0 and there's an impedance mismatch between what's required for its dynamic and static modes.

badrabbit · on July 27, 2019

I use python scripts these days but I don't think they're any better than bash or perl scripts. Just more popular these days.

zmmmmm · on July 27, 2019

The problem with python is even if you do it well, you don't get very far above a script. As oweiler above mentions, this is where languages with a bit more comprehensive support for true incremental typing, structuring etc. work better. It's definitely one of the things I like about Groovy - that it's both a better first line scripting language AND cuts it as a first class structured application development language on a par with Java etc - and you can do incremental shades of gray all the way in between. Of course, you have to actually DO it which is where the real trap is. But its definitely a level above Python where the friction of transitioning from "this is fine as an ad hoc script" to "This really ought to be a proper library / module in a statically typed language" is high enough that it will basically never happen.

dijksterhuis · on July 27, 2019

The transition you mention for Python is actually not that high.

Error handling, edge cases, unit tests, type checking, packaging can all be performed in Python.

Type checks are the biggest bug bear. Otherwise, it’s relatively easy to get things clean and tidy (packaged).

xiphias2 · on July 27, 2019

I'm using Julia the same way for math computation: starting as a scripting language, but creating new types and adding typing information on the fly (start with a tuple, then create a struct if that tuple is useful multiple times).

chomp · on July 27, 2019

>This gets into a whole thing I call "scripter mentality". It seems like some people would rather call (say) tcpdump and parse the results instead of writing their own little program which uses libpcap.

This made me smile, because it made me remember a suite called “DSC” that I could imagine inspired this post. It did work, but seemed like a kludge. Last time I used it was when this post was written actually, and looks like it does use libpcap now.

https://www.dns-oarc.net/tools/dsc

coldtea · on July 27, 2019

>This gets into a whole thing I call "scripter mentality". It seems like some people would rather call (say) tcpdump and parse the results instead of writing their own little program which uses libpcap.

And those people are right. Too many systems have too many unnecessary layers and "just in case" code, which instead of making them more robust than a small script, serves to slow them down, and increase the possible failure modes exponentially...

tyingq · on July 27, 2019

Personally, I don't like tying together the idea of "scripting" and being unreliable.

I've done lots of shell and Perl, and you can make them as reliable and bulletproof as any other option.

Do lots of scripts ignore return codes, lack retry logic, sanity checks, etc? Sure, but so do lots of applications written in compiled languages.

"Scripting" isn't really the issue.

egdod · on July 27, 2019

For a task like the newspaper one, the cost of getting every edge case perfect probably isn’t justified. Write the simple version, and wrap the whole thing in a try/catch. When it falls over, have it email the receptionist and tell him to look at the paper.

smitty1e · on July 27, 2019

The battle is tactics vs. strategy.

If you really have finite, bounded requirements, then treat it like a full project with front-end design, granular types, kit & kaboodle.

So often these shiny objects are more moving targets.

contingencies · on July 27, 2019

Opinion + metaphorical problem = this article. The key issue with the proposed metaphorical problem is the instability of its input. Garbage in, garbage out. Nobody can predict the future (but we can probabilistically interpret it if necessary!). To my mind, the article's conclusions take this unadmitted property of the metaphorical problem and use it to baselessly question an entire category of solutions.

lmilcin · on July 27, 2019

I don't exactly agree with this. Solutions like that can be optimal under certain assumptions.

In my mind, there are two ways to write software.

One way, represented by traditional product shipped to customers (think Microsoft Excel), is writing software in such a way you can show it works correctly under all possible circumstances. You need to write every piece of this software to be resilient to changes in environment and always do the right thing no matter what. Writing software like that takes care and is very expensive.

Another way, represented by typical business application written for use by the same company, is writing software to work in only very narrow possible set of circumstances. You demonstrate that it works and you ship it. Do you need it to work on all possible OS-es? Hell, no, you have an image and it works on it, it is enough. You insulate yourself from outside interference as much as possible, docker images, tightly controlled network environment, dependencies, only single supported integration architecture, etc.

The second way is the way to go when building "enterprise" software for internal consumption. You need to understand what you are doing. You need to understand why you are doing. If you are doing it correctly it will always be cheaper and better than investing in "perfect" solution.

donatj · on July 27, 2019

I find well constructed shell scripts oft more reliable than large tools built for a specific purposes. I think this koan applies.

http://www.catb.org/~esr/writings/unix-koans/ten-thousand.ht...

janpot · on July 27, 2019

When I compose small "do one thing well" shell scripts together, people call it "the unix philosophy". When I compose small "do one thing well" modules together in node.js, people call it "a dependency hell".

kungtotte · on July 27, 2019

Every distro and BSD everywhere comes with e.g. find, tr, grep, and so on so all you do is write the glue in a language that also ships with the OS (and is portable across all of them if you keep it POSIX).

How do I run your node.js thing on my system without installing dozens or even hundreds of packages and modules?

It's not the same thing.

bitwize · on July 27, 2019

Yes, but... with the appropriate crates you can write a 50-line Rust version of the same program that runs, for all intents and purposes, as fast as the C version, and more correctly and safely, while requiring little more time than the shell version to test and debug.

z3t4 · on July 27, 2019

Some programmers will say the pipes are beauty and the reliable code is ugly. But the real world is ugly!

patsplat · on July 27, 2019

Investing time upfront is only productive if one has ready access to a broad and relevant corpus of data.

walshemj · on July 27, 2019

I think assuming that some local news paper is going to have the same layout in two years is a big ask here.

The actual way to do this would be scripting using beautiful soup and similar tools.