I’m learning to read, write and understand Go at the moment:
As part of this, I am reading “The Go Programming Language” by Donovan and Kernighan and completing the exercises in it.
In the first tutorial chapter, there are samples of code for detecting duplicated lines in text files. The technique put forward by book is to use a map of strings to integers, where the key is the line and the value is the count. The programs simply loop over the lines and increment the value for the current line in the map.
An exercise that is left for the reader is adapt one of the existing sample programs to read several files and print the names of the files that contain each duplicated line.
The first problem to overcome is that the names of the files have not be recorded as the input files are being processed. This means that when we come to print the data that we have collected, we have lost that information.
My first approach was to add a map of strings to string slices. The idea was to add the file name to the string slice for that line as each line is read. This works in the sense that we have the list of file names for each line that is duplicated. However, the list of file names will contain duplicates itself if a line is duplicated in a file. I could wrap the code to append the new file name in an “if” statement and check whether that item is already there. However, checking whether a slice contains an item is O(n), which sounds like too much unnecessary work. I needed data structure that discards duplicates so that each item is unique. One way to do this is to use the keys of a map of strings to integers as the values that you wish to keep unique and increment the values of the map.
I created the following map of strings to maps of strings to ints:
lineCountsInFiles := make(map[string]map[string]int)
and incremented the values as simply as:
lineCountsInFiles[line][filename]++
One gotcha that I ran into during execution was that the nested map also needs to be initialized before we can increment the value:
if lineCountsInFiles[line] == nil { lineCountsInFiles[line] = make(map[string]int) }
This caught me because we can increment a value in a map of strings to ints without any initialization as the int has default value that be incremented:
counts[line]++
Once the files have been processed, we have a map from the lines in all the files to a map from the names of the files in which they appear to the counts of each line in that file.
Putting everything together looks like this:
// Prints the counts, lines and names of files where // a line is duplicated package main import ( "bufio" "fmt" "os" ) func main() { counts := make(map[string]int) lineCountsInFiles := make(map[string]map[string]int) for _, filename := range os.Args[1:] { f, err := os.Open(filename) if err != nil { fmt.Fprintf(os.Stderr, "Problem reading %v: %v\n", filename, err) continue } input := bufio.NewScanner(f) for input.Scan() { line := input.Text() counts[line]++ if lineCountsInFiles[line] == nil { lineCountsInFiles[line] = make(map[string]int) } lineCountsInFiles[line][filename]++ } f.Close() } for line, n := range counts { if n > 1 { fmt.Printf("%d, %v\n", n, line) for filename, count := range lineCountsInFiles[line] { fmt.Printf("\t%d,%v\n", count, filename) } } } }
Which is saved here:
https://github.com/robert-impey/CodingExperiments/blob/master/Go/dup-with-names.go
For files like:
apples coconuts apples bananas apples bananas
https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/duplicated-fruit.txt
and
apples
https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/more-duplicated-fruit.txt
The output might be something like:
C:\Users\Robert\code\CodingExperiments\Go>dup-with-names.exe duplicated-fruit.txt more-duplicated-fruit.txt 4, apples 3,duplicated-fruit.txt 1,more-duplicated-fruit.txt 2, bananas 2,duplicated-fruit.txt
I write “might be” as the order that keys are retrieved from maps is deliberately undefined.
There are probably lots of ways to solve this problem. If you have any suggestions, please let me know.
How do you feed Stdin to this program. I’ve tried piping it, or just running the program and then typing input, but it never terminates on its own. If I Ctl + C to end it, nothing happens either. I’ve got it working with files, but can’t seem to get it to work with Stdin.
Thanks in advance!
LikeLike
Sort of figured it out. I can pipe a file’s contents to this script like this:
cat test.txt | go run main.go
…but still not sure how to directly process text entered via Stdin
LikeLike
Hi Andrew,
Thanks for your comment.
Part of the problem that I was trying to solve with this program was to find duplicated lines across more than one file and print the counts per file. That rules out using the STDIN, as I would not have the file name at that point.
My solution was to pass file names as inputs to the program via os.Args and read the contents:
at
https://github.com/robert-impey/CodingExperiments/blob/5d7a5b9dbf86210312a98b7f18a795be839e354c/Go/src/dup-with-names/dup-with-names.go#L15
To run that in bash, I was able to run:
or in PowerShell:
I hope that helps!
LikeLike