Duplicated Line Finding in Go

I’m learning to read, write and understand Go at the moment:

https://golang.org/

As part of this, I am reading “The Go Programming Language” by Donovan and Kernighan and completing the exercises in it.

In the first tutorial chapter, there are samples of code for detecting duplicated lines in text files. The technique put forward by book is to use a map of strings to integers, where the key is the line and the value is the count. The programs simply loop over the lines and increment the value for the current line in the map.

An exercise that is left for the reader is adapt one of the existing sample programs to read several files and print the names of the files that contain each duplicated line.

The first problem to overcome is that the names of the files have not be recorded as the input files are being processed. This means that when we come to print the data that we have collected, we have lost that information.

My first approach was to add a map of strings to string slices. The idea was to add the file name to the string slice for that line as each line is read. This works in the sense that we have the list of file names for each line that is duplicated. However, the list of file names will contain duplicates itself if a line is duplicated in a file. I could wrap the code to append the new file name in an “if” statement and check whether that item is already there. However, checking whether a slice contains an item is O(n), which sounds like too much unnecessary work. I needed data structure that discards duplicates so that each item is unique. One way to do this is to use the keys of a map of strings to integers as the values that you wish to keep unique and increment the values of the map.

I created the following map of strings to maps of strings to ints:

lineCountsInFiles := make(map[string]map[string]int)

and incremented the values as simply as:

lineCountsInFiles[line][filename]++

One gotcha that I ran into during execution was that the nested map also needs to be initialized before we can increment the value:

if lineCountsInFiles[line] == nil {
    lineCountsInFiles[line] = make(map[string]int)
}

This caught me because we can increment a value in a map of strings to ints without any initialization as the int has default value that be incremented:

counts[line]++

Once the files have been processed, we have a map from the lines in all the files to a map from the names of the files in which they appear to the counts of each line in that file.

Putting everything together looks like this:

// Prints the counts, lines and names of files where
// a line is duplicated
package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    counts := make(map[string]int)
    lineCountsInFiles := make(map[string]map[string]int)

    for _, filename := range os.Args[1:] {
        f, err := os.Open(filename)
        if err != nil {
            fmt.Fprintf(os.Stderr, "Problem reading %v: %v\n", filename, err)
            continue
        }
        input := bufio.NewScanner(f)
        for input.Scan() {
            line := input.Text()
            counts[line]++
            if lineCountsInFiles[line] == nil {
                lineCountsInFiles[line] = make(map[string]int)
            }
            lineCountsInFiles[line][filename]++
        }
        f.Close()
    }

    for line, n := range counts {
        if n > 1 {
            fmt.Printf("%d, %v\n", n, line)
            for filename, count := range lineCountsInFiles[line] {
                fmt.Printf("\t%d,%v\n", count, filename)
            }
        }
    }
}

Which is saved here:

https://github.com/robert-impey/CodingExperiments/blob/master/Go/dup-with-names.go

For files like:

apples
coconuts
apples
bananas
apples
bananas

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/duplicated-fruit.txt

and

apples

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/more-duplicated-fruit.txt

The output might be something like:

C:\Users\Robert\code\CodingExperiments\Go>dup-with-names.exe duplicated-fruit.txt more-duplicated-fruit.txt
4, apples
        3,duplicated-fruit.txt
        1,more-duplicated-fruit.txt
2, bananas
        2,duplicated-fruit.txt

I write “might be” as the order that keys are retrieved from maps is deliberately undefined.

There are probably lots of ways to solve this problem. If you have any suggestions, please let me know.

3 thoughts on “Duplicated Line Finding in Go

  1. Andrew February 2, 2020 / 1:24 am

    How do you feed Stdin to this program. I’ve tried piping it, or just running the program and then typing input, but it never terminates on its own. If I Ctl + C to end it, nothing happens either. I’ve got it working with files, but can’t seem to get it to work with Stdin.

    Thanks in advance!

    Like

  2. Andrew February 2, 2020 / 1:40 am

    Sort of figured it out. I can pipe a file’s contents to this script like this:

    cat test.txt | go run main.go

    …but still not sure how to directly process text entered via Stdin

    Like

  3. Robert Impey February 8, 2020 / 9:32 am

    Hi Andrew,

    Thanks for your comment.

    Part of the problem that I was trying to solve with this program was to find duplicated lines across more than one file and print the counts per file. That rules out using the STDIN, as I would not have the file name at that point.

    My solution was to pass file names as inputs to the program via os.Args and read the contents:

    for _, filename := range os.Args[1:] {
        f, err := os.Open(filename) 
        ...
    

    at

    https://github.com/robert-impey/CodingExperiments/blob/5d7a5b9dbf86210312a98b7f18a795be839e354c/Go/src/dup-with-names/dup-with-names.go#L15

    To run that in bash, I was able to run:

    rober@CYNANE MINGW64 ~/code/CodingExperiments/Go/src/dup-with-names (master)
    $ go run dup-with-names.go ../duplicated-fruit.txt ../more-duplicated-fruit.txt
    4, apples
            3,../duplicated-fruit.txt
            1,../more-duplicated-fruit.txt
    2, bananas
            2,../duplicated-fruit.txt

    or in PowerShell:

    PS C:\Users\rober\code\CodingExperiments\Go\src\dup-with-names> go run .\dup-with-names.go ..\more-duplicated-fruit.txt ..\duplicated-fruit.txt
    2, bananas
            2,..\duplicated-fruit.txt
    4, apples
            3,..\duplicated-fruit.txt
            1,..\more-duplicated-fruit.txt

    I hope that helps!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s