The Beatles in The Lord of the Rings

Many years ago, when I was studying Artificial Intelligence at Birmingham University, I read about a project that the Beatles had proposed that was never completed. They wanted to make a film version of The Lord of the Rings, with Paul as Frodo, Ringo as Samwise, George as Gandalf and John as Gollum. What is more, they wanted Stanley Kubrick (John was a huge fan of 2001) to direct the movie.

Stanley Kubrick wasn’t interested, the Beatles split up, and the project never got off the ground.

At the time, I was wondering a lot about the question “Could a machine ever think?” Different philosophers put forward different answers to that question. Some dismissed the question as being as important as asking “Can submarines swim?” Others, notably proponents of the Turing Test, sought to answer the question by looking at the observable performance of the system. A milestone that I thought about was an AI system that could produce the film described above just from prompting. It seemed like science fiction fantasy at the time.

20 years later, and AI has come a very long way. In just a few minutes this morning, I was able to create some images using Hugging Face’s Stable Diffusion system purely from written prompts from my phone:

https://huggingface.co/spaces/stabilityai/stable-diffusion

John Lennon as Gollum
John Lennon as Gollum
Paul McCartney as Frodo Baggins
Paul McCartney as Frodo Baggins
Ringo Starr as Samwise Gamgee
Ringo Starr as Samwise Gamgee
George Harrison as Gandalf
George Harrison as Gandalf

It’s not an epic in the style of Stanley Kubrick. Nor has it generated a Beatles-esque soundtrack. However, the power of these AI systems is astounding.

What I learned rewriting old Perl scripts in Go

As part of my project to learn the Go programming language, I decided to rewrite some of my old Perl scripts in the newer language to see what was better, if anything was worse or if things stayed the same.

The scripts are not very glamorous, but are useful for my set up. One is for generating scripts that wrap around rsync in order to keep two directories in synch. The other (which is closely related) is for keeping track of which files have been marked for deletion and keeps deleting them if they reappear. Once one starts to copy files back and forth between directories, deleting files becomes a little bit complicated as the file you delete will keep reappearing unless you delete it everywhere. The script I wrote them before Dropbox and the like had come on the scene. I keep them around as I don’t like to have the only copy of my photos and videos in a folder managed by someone else. They are also useful for folders that have too much content to be stored practically in Dropbox, like virtual machine files, and synching folders on servers via SSH.

I have been meaning to update the scripts and change their behaviour for a while, so a rewrite is a good time to ditch some obsolete features. The synch script generator used to be able to generate batch files to run on Windows. Now, it just creates bash scripts. This is fine (for me) as, on Windows, I use Cygwin for rsync, so bash is not a problem for me as a dependency.

Another change is that the script generator used to silently skip generating a script if a file was in its destination. The thinking was that the generated scripts could be considered scaffolding for more capability that could be added by adding to or amending the scripts. The script should not overwrite these changes. Experience, here and elsewhere, has taught me that this is a nightmare to work with. Generated files should be regeneratable at any time. In light of this, the generated scripts are marked with a warning comment at the top saying that they have been autogenerated and are blasted out of the way when the program is run again, potentially as part of a scheduled task.

The script generation program is at:

https://github.com/robert-impey/generate-synch-scripts

The second program is for making sure that files stay deleted when two folders are synched as described above. This had been at

https://github.com/robert-impey/stay-deleted

However, as the project grew and I need to move code to more than one file I learned that package names in Go should not contain punctuation, so I started a fresh project and moved my code there:

https://github.com/robert-impey/staydeleted

My new version of the program is working correctly. I’ve also added a few new features like deleting old metadata files and an option to run repeatedly at random times, which is useful for running as a scheduled task.

The most immediate advantage that I have found for this script is the ease of distributing a binary generated by the Go compiler. The nature of the software requires that the program runs on every computer in the system, for my set up that includes Windows, macOS and Linux. Getting a perl script with a few library dependencies to run on all those computers was a pain. With Go, it’s been trivial so far.

In order to have a more modern command-line interface, my program now uses spf13’s Cobra library:

https://github.com/spf13/cobra

This has forced me to reorganize my code to some extent. This has been a good thing. However, I’m still very much at the beginning of my Go learning journey. Whilst the program works; the code is a bit of a mess. I’m not sure how to apply the clean code principles (such as SOLID) that I use in my day job writing C# to Go. Perhaps this is one of the main advantages of aiming to be a polyglot developer. Everyone should aim to write readable code in every language. We discuss and come to an agreement about design principles to do with how to use the language’s features in order to make code more readable. When one moves to a new language, you might not know how the language goes about implementing those features or those features don’t exist. What happens then to those design principles? Can they be reformulated for the new language?

Stress Testing Algorithmic Code

I keep a “To Do” list stored in a text file. I add sub-tasks to tasks by putting them on the line below the task at a greater depth of indentation. This works for me as I find that whenever I look at any task, I can always think of ways to break it down further. Also, I get to see everything in one place.

A few years ago, I decided to write a tool to sort my “To Do” list:

https://github.com/robert-impey/tree-sorter

The complex part was parsing tree structures of arbitrary depth and sorting each tree and subtree independently.

For example,

II Animals
  2 Dogs
  3 Cats
  1 Fish
I Plants
  b Nettles
  c Oak
  a Grass

becomes

I Plants
  a Grass
  b Nettles
  c Oak
II Animals
  1 Fish
  2 Dogs
  3 Cats

This works fine for my tasks. My list is saved in a file and a scheduled task sorts it periodically.

However, one annoyance was that if the file was was already in order, the program would sort and write out anyway. Another problem, was that I was never very pleased with the algorithm that I wrote. Although the performance was acceptable for the 300 odd lines of my file, I suspect that I can do better. I already had in place a range of unit tests and sample files as fixtures. I could just start by trying to improve the data structures and algorithm at the heart of the program. However, I want to be really sure that I don’t introduce a regression for a particularly tricky tree that I hadn’t thought of.

Fortunately, the input and the output of the algorithm are easy to describe. The input is any tree structure in space indented lines in a text file; the output is those lines in order. I decided to add a stress test to the program. The test would randomly generate trees of a given size and depth. The sorting code would then attempt to put them into order. If the result is not in order, the program halts and prints out the offending input and output. Otherwise, the program would keep looping round (generating, sorting and checking trees) until I killed the process.

while True:
    random_tree_lines = generate_random_tree_lines(
        args.Depth,
        args.Items,
        args.Length,
        args.Alphabet)

    tree = lines_to_tree(random_tree_lines)

    lines = tree.get_lines()

    if not are_lines_sorted_tree(lines):
        print('A tree was not sorted correctly')

        print('Input')
        for line in random_tree_lines:
            print(line)

        print('Output')
        for line in lines:
            print(line)

        break q

The complete listing is at:

https://github.com/robert-impey/tree-sorter/blob/master/stresstest.py

Such a stress test is intended to be a supplement to unit tests. If a tree is not sorted correctly, it should be simplified and added to the unit test fixtures. Then, once the code has been fixed to deal with the tricky input, we know that we can avoid regressions. Generally, random generators are less forgiving than developers when it comes to coming up with tricky inputs.

This stress test will keep running forever if left uninterrupted. Typically, if there is an error anywhere in a system of this level of complexity, it will be found within seconds. The stress test didn’t find an error conditions in the existing code. However, it did take a few iterations to iron out a few off-by-one errors in my code for testing that a tree was in order! This helped me build up the unit tests for that part of the program.

As well as allowing me to make changes to the algorithm at the heart of the program, having a function that tells me if a tree is in order or not allows me to skip overwriting the existing file if the file is already in order:

lines = get_lines(file_name)

if in_place and are_lines_sorted_tree(lines):
    print('In place sorting requested but already sorted.')
else:
    # Sorting work...

The complete listing is here:
https://github.com/robert-impey/tree-sorter/blob/master/treesorter.py

Make Turning Windows Off and On Again Great Again

There’s a joke about all Windows related problems being solvable by turning the computer off and on again. It’s one of the things that works most reliably about the OS. Well, even this has been broken by Microsoft in the Windows 10 Fall Creators Update.

When you get to the point where you have more than a dozen virtual desktops running with many different programs, your computer (not to mention your brain) will start to get overloaded. At times like these, I just want to hit refresh (to coin a phrase…) and get back to an unmussed system. However, a new “feature” in the Fall Creators Update of Windows 10 will try to reopen as many of the programs that you had running beforehand as possible. Not all programs will get restarted. Tiny programs like Notepad++ stay unloaded, but bloated behemoths that hang for 30 seconds to a minute to load (like Excel and Visual Studio) get restarted, maybe with the files that you had opened, maybe not. If you were using virtual desktops beforehand and had carefully separated your programs by task, all the programs will be in the first one but you will still have several empty and useless virtual desktops open. Instead of getting a fresh start on a clear desktop, you are left waiting for programs to load and needed to close all the virtual desktops. Unfortunately, there’s no setting to just turn off this rather half-baked behaviour.

However, a little PowerShell can reduce the pain and shutdown the computer more thoroughly.

I don’t want to accidentally trigger this, so my script starts with a confirmation:

Write-Output "This will close all your virtual desktops and shutdown the computer."
$Confirm = Read-Host "Hit Y to confirm."
if ($Confirm.ToLower() -ne 'y')
{
    Write-Output "Doing nothing"
    exit
}

Write-Output "Shutting down..."

Closing all the other virtual desktops can be achieved with by wrapping around MScholtes’ VirtualDesktop:

while ($true)
{
    $CountCommand = "$($VirtualDesktopExe) /C"
    $CountOutput = iex $CountCommand
    if ($CountOutput -match 'Count of desktops: (\d+)')
    {
        $Count = $Matches[1]

        if ($Count -gt 1)
        {
            Write-Output "You still have $($Count) virtual desktops."
            iex "$($VirtualDesktopExe) /R"
        }
        else
        {
            Write-Output "Only one virtual desktop at this point"
            break
        }
    }
    else
    {
        Write-Error "Unable to parse result of command '$($CountCommand)': '$($CountOutput)'"
        exit
    }
}

This simply invokes the program with the “/C” flag (count) then calls the program with “/R” (remove) until only one virtual desktop remains.

After that, the script just invokes the old fashioned shutdown command from Ramesh Srinivasan’s article above about this new feature.

shutdown.exe /s /t 0

On my desktop, I have a link to a script that invokes the PowerShell script with the location of the VirtualDesktop.exe file on my machine:

PowerShell -command "F:\Robert\Dropbox\local-scripts\_Common\CloseVirtualDesktopsAndShutdown.ps1 -VirtualDesktopExe F:\Robert\Dropbox\executables\Windows\x64\VirtualDesktop.exe"

Now I know that I will get a fresh start when I turn on my computer again. I’m not really sure why this isn’t the default behaviour.

Duplicated Line Finding in Go

I’m learning to read, write and understand Go at the moment:

https://golang.org/

As part of this, I am reading “The Go Programming Language” by Donovan and Kernighan and completing the exercises in it.

In the first tutorial chapter, there are samples of code for detecting duplicated lines in text files. The technique put forward by book is to use a map of strings to integers, where the key is the line and the value is the count. The programs simply loop over the lines and increment the value for the current line in the map.

An exercise that is left for the reader is adapt one of the existing sample programs to read several files and print the names of the files that contain each duplicated line.

The first problem to overcome is that the names of the files have not be recorded as the input files are being processed. This means that when we come to print the data that we have collected, we have lost that information.

My first approach was to add a map of strings to string slices. The idea was to add the file name to the string slice for that line as each line is read. This works in the sense that we have the list of file names for each line that is duplicated. However, the list of file names will contain duplicates itself if a line is duplicated in a file. I could wrap the code to append the new file name in an “if” statement and check whether that item is already there. However, checking whether a slice contains an item is O(n), which sounds like too much unnecessary work. I needed data structure that discards duplicates so that each item is unique. One way to do this is to use the keys of a map of strings to integers as the values that you wish to keep unique and increment the values of the map.

I created the following map of strings to maps of strings to ints:

lineCountsInFiles := make(map[string]map[string]int)

and incremented the values as simply as:

lineCountsInFiles[line][filename]++

One gotcha that I ran into during execution was that the nested map also needs to be initialized before we can increment the value:

if lineCountsInFiles[line] == nil {
    lineCountsInFiles[line] = make(map[string]int)
}

This caught me because we can increment a value in a map of strings to ints without any initialization as the int has default value that be incremented:

counts[line]++

Once the files have been processed, we have a map from the lines in all the files to a map from the names of the files in which they appear to the counts of each line in that file.

Putting everything together looks like this:

// Prints the counts, lines and names of files where
// a line is duplicated
package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    counts := make(map[string]int)
    lineCountsInFiles := make(map[string]map[string]int)

    for _, filename := range os.Args[1:] {
        f, err := os.Open(filename)
        if err != nil {
            fmt.Fprintf(os.Stderr, "Problem reading %v: %v\n", filename, err)
            continue
        }
        input := bufio.NewScanner(f)
        for input.Scan() {
            line := input.Text()
            counts[line]++
            if lineCountsInFiles[line] == nil {
                lineCountsInFiles[line] = make(map[string]int)
            }
            lineCountsInFiles[line][filename]++
        }
        f.Close()
    }

    for line, n := range counts {
        if n > 1 {
            fmt.Printf("%d, %v\n", n, line)
            for filename, count := range lineCountsInFiles[line] {
                fmt.Printf("\t%d,%v\n", count, filename)
            }
        }
    }
}

Which is saved here:

https://github.com/robert-impey/CodingExperiments/blob/master/Go/dup-with-names.go

For files like:

apples
coconuts
apples
bananas
apples
bananas

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/duplicated-fruit.txt

and

apples

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/more-duplicated-fruit.txt

The output might be something like:

C:\Users\Robert\code\CodingExperiments\Go>dup-with-names.exe duplicated-fruit.txt more-duplicated-fruit.txt
4, apples
        3,duplicated-fruit.txt
        1,more-duplicated-fruit.txt
2, bananas
        2,duplicated-fruit.txt

I write “might be” as the order that keys are retrieved from maps is deliberately undefined.

There are probably lots of ways to solve this problem. If you have any suggestions, please let me know.

Have you celebrated Baegil yet?

A Korean friend asked my wife if our son had celebrated Baegil or 100 days yet. Not being familiar with the celebration, we said that we hadn’t kept track of the days since his birth. I decided to write a little program in Go in order to check what day of one’s life one is currently on.

The program should take the subject’s date of birth as a string from the first command argument. This was simple enough for me to decide against using the flags package and check the command line arguments directly:

if len(os.Args) == 1 {
    os.Stderr.WriteString("Please tell me your date of birth!\n")
} else {
...
}

I wanted to save programming effort and avoid reinventing the square wheel by using a library for handling the dates. The time package from the standard library can represent times and parse strings to time structs.

The time package has a peculiar (but easy to grok) way of specifying date layouts for formatting. Rather than %Y, %h, %s and so on from the standard C library and its descendants, in Go a fixed date where days are 1 and months are 2 and so on is used. So, to parse dates such as “2016-06-20” or “1980-10-28”, we need:

var layout = "2006-01-02"
var dateOfBirth, dOBParseError = time.Parse(layout, os.Args[1])

Full details can be found on this blog post:

https://pauladamsmith.com/blog/2011/05/go_time.html

Of course, a user of the program might not enter a parseable date string, so we need a little bit more error handling:

if dOBParseError != nil {
    fmt.Fprintf(os.Stderr, "Unable to parse your date of birth: %s\n", dOBParseError)
} else {
...
}

Once we have a date of birth, we can perform our arithmetic. The time package has a Since(...) method that can give you the duration of time since a given time struct:

var age = time.Since(dateOfBirth)

For a simple program like this, methods like Since are more convenient than creating a date object of the current time and comparing the given time to it. However, this does create an issue when testing code that does anything to do with time. If your code makes direct use of the system’s clock, your functions are impure as the same input won’t always return the same output. How do you test that a given input produces the correct output? Such tests would only pass at specific times on a given date. The purely functional way to do this is to write your business logic handling functions to accept all the used time objects already created, so that appropriate mocks can be passed in during tests without calling any methods that get data from the system’s clock. Another approach is the override the methods that call the system’s clock so that they return a set time in the test. This is called monkey patching. I’ve heard of people even setting the system clock during the test, but this is likely to have unforeseen consequences, especially on a shared build server.

The time package defines an Hours function for Duration structs but not one for days. A quick and dirty solution to this is the divide the number of hours by 24 and floor the result:

var ageInDays = math.Floor(age.Hours() / 24)

This makes the assumption that every day has exactly 24 hours. What about leap seconds? A person might have a few leap seconds during a lifetime. What if this program is run within a few seconds of midnight? The result might be different from the true result. As we are not using the precise time of birth and are flooring the number of days anyway, this is not too much of a concern. However, if really precise maths is needed or you using the components of one time struct to create a new time struct, inaccuracies can creep into your code. The fun really begins when you need to handle issues like leap years, time zones and seasonal time adjustments. If I ran this code in San Francisco in Winter for someone who was born in Sydney during Daylight Saving Time, would the result be accurate? A more fully developed version of this program would probably use a library to handle the mind boggling complexity of time tracking in the real world.

Another issue to consider is off by one errors. I’m flooring the number of days in order to say, “days since”. If I enter today’s date, I will get 0 as the subject’s age has not reached one day yet. This might be confusing to someone expecting 1 on the first day and so on. How ages are reckoned changes among different cultures. How Koreans consider their ages is surprising to many foreigners.

https://en.wikipedia.org/wiki/East_Asian_age_reckoning#Korean

Koreans celebrate Baegil on the 100th day, so we would expect the program to return 99 on the appropriate day. A programmatic solution to this might be add 1 to the result and change the wording. Programmers need to consider the audience of their programs carefully and gather requirements fully.

Obviously, you can’t be born in the future and still use the program, so we need a bit more user input checking:

if ageInDays < 0 {
    os.Stderr.WriteString("Wow! I must be talking to a foetus!\n")
} else {
    ...
}

If we’ve made it this far, we can now print the age in days:

fmt.Printf("Days old: %.0f\n", ageInDays)

Putting it all together, this is my program:

package main

import (
    "fmt"
    "math"
    "os"
    "time"
)

func main() {
    if len(os.Args) == 1 {
        os.Stderr.WriteString("Please tell me your date of birth!\n")
    } else {
        var layout = "2006-01-02"
        var dateOfBirth, dOBParseError = time.Parse(layout, os.Args[1])

        if dOBParseError != nil {
            fmt.Fprintf(os.Stderr, "Unable to parse your date of birth: %s\n", dOBParseError)
        } else {
            var age = time.Since(dateOfBirth)
            var ageInDays = math.Floor(age.Hours() / 24)

            if ageInDays < 0 {
                os.Stderr.WriteString("Wow! I must be talking to a foetus!\n")
            } else {
                fmt.Printf("Days old: %.0f\n", ageInDays)
            }
        }
    }
}

Entering my son’s date of birth, I see that we still have a couple of days to go:

$ ./daysOld 2016-06-20
Days old: 97

The source code can be found at:

https://github.com/robert-impey/CodingExperiments/blob/master/Go/daysOld.go

Coin Flips with Go

Following on from my little experiment with flipping coins millions of times, I thought that it would be interesting to write the same program in Go for comparison.

func main() {
    rand.Seed(time.Now().UTC().UnixNano())

    var competitionSize, runs int
    flag.IntVar(&competitionSize, "competitionSize", 10, "The size of each coin tossing competition.")
    flag.IntVar(&runs, "runs", 10, "The number of runs of the competition")

    flag.Parse()

    fmt.Println("heads, tails, flips")
    for i := 0; i < runs; i++ {
        countHeads := 0
        for j := 0; j < competitionSize; j++ {
            countHeads += rand.Intn(2)
        }

        fmt.Printf("%d, %d, %d\n", countHeads, competitionSize-countHeads, competitionSize)
    }
}

The basic idea of how to write the program is unchanged in this language. The output showed a similar distribution to the C# program.

However, Go has straightforward command line arguments parsing built into the standard library (like Python or Perl). In C#, there are a number of third party libraries that do this that can be installed via NuGet. I went with Command Line Parser.

To use this, I needed to add a class with properties with attributes to describe my expected options:

class Options
{
    // Taken from http://www.bbc.co.uk/news/politics/eu_referendum/results
    [Option('s', "competitionSize", DefaultValue = 33551983, HelpText = "The size of each coin tossing competition.")]
    public int CompetitionSize { get; set; }

    [Option('r', "runs", DefaultValue = 1000, HelpText = "The number of runs of the competition")]
    public int Runs { get; set; }
}

The command line arguments are then parsed in the main method:

var commandLineArgs = new Options();
if (Parser.Default.ParseArguments(args, commandLineArgs))
{
    var runs = commandLineArgs.Runs;
    var flips = commandLineArgs.CompetitionSize;
    ...
}
else
{
    WriteLine("Unable to parse the command line arguments!");
}

While this is not difficult to understand, it is more complicated than it needs to be and requires quite a bit more typing than in the Go program.

10 runs of the referendum simulation in both programs on Windows 10 with an Intel i7 4500U running at 1.8GHz took 7 seconds for the C# program and 13 seconds for the Go program. I wouldn’t take such simple programs very seriously for drawing conclusions about the relative speed of these languages.

The updated C# program and the complete Go program are on GitHub:
https://github.com/robert-impey/CodingExperiments/blob/master/C-Sharp/CoinFlipsAreRandom/CoinFlipsAreRandom/Program.cs
https://github.com/robert-impey/CodingExperiments/blob/master/Go/coinToss.go

Was the EU Referendum Random?

One of the claims that I saw on social media in the aftermath of the recent EU referendum here in the UK was that the result (52% to 48%) was so close that it was little different from tossing a coin.

Without getting bogged down in the politics of that referendum, or the various campaigns that led up to it; I want to consider whether this claim holds any water. How similar to millions people each tossing a coin and voting accordingly was the result?

According to the BBC, there were 17,410,742 votes to leave and 16,141,241 votes to remain, giving a total of 33,551,983 votes. If we were to make each of these people toss a coin and count up the results, the ratio of heads/tails or remains/leaves could be anywhere between all heads and all tails. However, we would expect the counts to be about equal if the coins were all fair. Of course, any ratio is possible, but if we were to run the coin tossing game repeatedly, we would expect to mean ratio to converge on 1:1. How likely would a 52:48 ratio be?

The leave side got a share of the vote equal to 0.51891842. Therefore, their absolute deviation from the expectation (the mean, or 0.5) is 0.01891842. The difference between the two counts is 1,269,501. Would we expect to see a deviation of this magnitude in a coin tossing competition?

Being a computer programmer rather than a mathematician, I’m going to look at this using a simple program.

WriteLine("heads, tails, flips, heads share");

var runs = 1000;

// Taken from http://www.bbc.co.uk/news/politics/eu_referendum/results
var flips = 33551983;

var randomNumberGenerator = new Random();

for (var run = 0; run < runs; run++)
{
    var heads = 0;

    for (var coinToss = 0; coinToss  0)
        {
            heads++;
        }
    }

    WriteLine($"{heads}, {flips - heads}, {flips}, {(0.0 + heads) / flips}");
}

This program simulates 1,000 coin tossing competitions with 33,551,983 players and writes the counts as comma separated values.

Putting the output into Excel, the largest deviation was 0.000275081 (or a difference between counts of 18,459) and the smallest was 1.49022E-08 (which was 16,775,992 heads and 16,775,991 tails. This happened twice in 1,000 runs!) The largest difference between heads and tails was 69 times smaller than the result from the referendum.

Plotting the shares in decreasing order, we see how quickly larger deviations fall off:

deviation

Putting the deviations into bins and counting the competitions by deviation from the expectation, we see that smaller deviations are more common:

bins

Whatever else we might say about the result, we cannot seriously claim that the result was random.

The full program can be found here:

https://github.com/robert-impey/CodingExperiments/blob/master/C-Sharp/CoinFlipsAreRandom/CoinFlipsAreRandom/Program.cs

The Excel file can be found here:
http://www.reversing-entropy.com/wp-content/uploads/2016/09/EuRef.xlsx

Does Declarative Programming generally help?

A recent post compared different ways of approaching the same problem in Scala and Go:

https://www.quora.com/Scala-vs-Go-Could-people-help-compare-contrast-these-on-relative-merits-demerits/answer/Nick-Snyder-1?srid=dsGW&share=191eaf13

The Go approach is longer than the one in Scala, but every line is dead simple to understand. We might dismiss it as repetitive, boiler plate code rather than abstract, business logic. However, setting break points for debugging or changing the behaviour of the code when the new requirements arrive, six months down the line, is trivial. This can be a mixed blessing, as there’s really nothing to stop code like that ballooning in size as new conditional statements get added to cope with the changing needs of the software. Every 5 kLoC ball of mud that any programmer ever wrestled with started out as something dead simple like that.

The following article investigates how non-programmers approach programming problems and whether programming languages make the task harder because they do not match how an untrained mind thinks.

http://alumni.cs.ucr.edu/~ratana/PaneRatanamahatanaMyers00.pdf

More children developed rules to describe the behaviour of Pacman than instructions for the sprites to follow. People apparently start out thinking in a more declarative style and then develop the habit of thinking in the imperative style.

Is this an argument for using more declarative languages? If declarative programming is more natural, surely it would be easier. However, many programmers start out with imperative code and perhaps only try declarative code later. Is this just a historical or social accident caused by the greater availability of imperative languages compared to declarative ones? It is possible that this is a greater issue for novice programmers than for experienced ones.

Often, we cannot move between languages at whim. Few programmers have the privilege of being able to write part of their system in Go in the morning, then head over to Scala land after lunch. However, .Net has Linq, which affords a C# developer, like your author, greater freedom to move freely back and forth between imperative and declarative styles of programming.

Imagine a book stall in a market. There is a table with books with different numbers of pages with covers that are red or blue.

The types in this world are an enum for the colours and a class for the book:

enum Colour
{
    Red,
    Blue
}

class Book
{
    internal Book(string name, int numberOfPages, Colour colour)
    {
        Name = name;
        NumberOfPages = numberOfPages;
        Colour = colour;
    }

    internal string Name { get; }
    internal int NumberOfPages { get; }
    internal Colour Colour { get; }
}

We can model our bookstall with a list:

var books = new List
{
    new Book ("War and Peace", 1123, Colour.Red),
    new Book ("Doctor Zhivago", 803, Colour.Blue),
    new Book ("Jeeves and the Feudal Spirit", 154, Colour.Red),
    new Book ("Shogun", 1213, Colour.Red)
};

You want to write down the names of the books that are more than 500 pages long and have red covers. Here are three ways to solve this problem.

The imperative approach:

foreach (var book in books)
{
    if (book.Colour == Colour.Red && book.NumberOfPages > 500)
    {
        WriteLine(book.Name);
    }
}

This way, we are putting ourselves in the role of the person actually sifting through the books one by one and writing the names of the books that meet our criteria as we go. However, the presentation code (the WriteLine) is mixed up with the filtering logic in the loop.

Linq makes use of the Where extension method to allow the so-called fluent syntax:

var longRedBooks = books.Where(b => b.Colour == Colour.Red && b.NumberOfPages > 500);

foreach (var longRedBook in longRedBooks)
{
    WriteLine(longRedBook.Name);
}

Here, we first ask for the books that match our criteria, defined in the lambda expression passed to the Where method. Then, we print the books that matched. The filtering logic and presentation logic are now separate.

Using Linq’s query syntax below, the code looks quite different from the fluent syntax used above, but the meaning is the same. This approach is perhaps more familiar to developers with a background in SQL.

var longRedBooks = from book in books
                   where book.Colour == Colour.Red && book.NumberOfPages > 500
                   select book;

foreach (var longRedBook in longRedBooks)
{
    WriteLine(longRedBook.Name);
}

The complete source can be found on GitHub:
https://github.com/robert-impey/CodingExperiments/blob/master/C-Sharp/BookFinding/Program.cs

Which is easiest to understand? If the requirements change, let’s say that we now only want one book, which should be the longest book, shorter than 500 pages, with a blue cover, which approach will be easiest to update and understand with the new logic? With the Linq approach, we can take advantage of the type system to see that the query now returns a Book (which may be null) rather than an IEnumerable<Book> and handle the object appropriately. Will the change in meaning be as obvious in the imperative code?

For what it’s worth, I’ve experienced plenty of resistance in code reviews to Linq queries, but almost none to writing the same thing in a for loop. Sometimes the reasoning is to do with performance, but I doubt there’s much in that. Anyway, a read-only code review is a terrible place to make assumptions about relative performance. However, it’s very easy to reason about the performance of the for loop, but more advanced knowledge of lazy evaluation, the iterator pattern and its implementation is needed to understand what gets executed where and how many times for the linq approaches.

I would be very interested to see an analysis of open source projects that compared the frequency on these approaches to see what is the approach that people tend to settle on. Do pull request with Linq get accepted more or less often? Is code generally refactored from one to the other? In which direction?

How should this influence job interviews and hiring? If we want to hire experienced programmers, we could check for signs that their thinking has been moulded to the structures of industrial programming languages. Perhaps we should pity such creatures. Should we look for signs of skills in imperative programming in hands-on programmers and for a declarative approach in tech leads and more senior roles, where coordinating the efforts of several developers is the required skill?