15-118 Lecture 8 (Monday, July 15, 2013)

15-112 Lecture 8 (Monday, July, 2013)

File Input

Sometimes we want to read data from files or write data to files. In practice, this can become very tangled. But, fundamentally, it is simple.
Let's create a file called, "input.txt" as below:
  one
  1
  2
  two
  
And, given "input.txt", let's take a look at the example below:
#!/usr/bin/python

inputfile = open ("input.txt")

line = inputfile.readline()
print line                    # (but, nothing to see here)
line = inputfile.readline()   # (or here)
print line

line = inputfile.readline()
print line,                   # NOTICE THE COMMA HERE 
line = inputfile.readline()   # AND THE COMMA HERE
print line,

# Done wiht this use
inputfile.close()

# Re-open it
print "After reopen -- getting all the lines at once"
inputfile = open ("input.txt")
lines = inputfile.readLines()
print lines[0]
print lines[1]
print lines[2]
print lines[3]
  
And, pay attention to its output:
  one

  1

  two 
  2


  After reopen -- getting all the lines at once
  one
  1
  2
  two
  
We can also read all of the lines of the file into a List of lines using readLines(), e.g. "lines = inputFile.readlines()"
What are the take-aways?

open() opens the file with the name we provide, and gives us back a "file object" that represents it
We can use the readline() "method", which is like a function associated with an "object" to read a line from the file represented by the file object
the "dot operator" . associates a method with an object. In this way the object performing the readline() is clearly denoted.
When we print a line we have read from a file, it double spaces. This is because print automatically adds a newline and the line we read, itself, ends in a new line. So, we end up printing both. By placing a ,-comma after the print statement, it tells print not to print a newline -- so we only get one, the one internal to the line we read. The comma syntax might seem strange to you, but it makes some sense in that it was developed to make it easy to print multiple items on the same line, and listing multiple items separated by commas, such as as "a, b, c, d", is a very human notation. common.
readlines() reads an entire file into a List of strings, one line per string.
The close() indicates that we are done with the file object. Once closed, it no longer keeps track of the name of out file, or our position within it, e.g. how micuh we've read.

File Output
Writing to a file is very similar to reading from a file. Notice, though, the added argument to "open", specificaly, the "w". We are now giving open the name of the file -- and asking it to allow us to write to it.
Notice also the "\n". The is called the "New line escape character". It is a way of asking the system to insert a new line, like hitting enter, at that point.
#!/usr/bin/python

outpfile = open ("outfile.txt", "w")
outfile.write("Hello world!\n")
outfile.write("Hello great, wonderful world!\n")
outfile.close()
  
The file we created is shown below. We can view it, for example, via "more outfile.txt" from the command prompt.
Hello world!
Hello great, wonderful world!
  
Stripping White Space
People tend to be very insensitive to white space in strings -- we just don't notice it very easily. As a result, when processing strings entered by humans, we often want to strip out the extra white space; for example spaces, tabs, etc; leaving the rest of the string. There are three methods of help to us:

lstrip() -- strips leading spaces, e.g. those on the left side
rstrip() -- strips trailing spaces, e.g. those on the right side
strip() -- strips leading and trailing spaces, but leaves other spaces

Please consider the example below:
#!/usr/bin/python

spacedPhrase = "   Greetings and      Welcome     "

print "phrase: ---" + spacedPhrase + "---"
print "strip: ---" + spacedPhrase.strip() + "---"
print "lstrip: ---" + spacedPhrase.lstrip() + "---"
print "rstrip: ---" + spacedPhrase.rstrip() + "---"
print "lstrip + rstrip: ---" + spacedPhrase.lstrip().rstrip() + "---"
More String Methods and Functions
Before writing any function to manipulate a string, see if the Pytyhon libraries already provide it. Strings, themselves, are very rich objects and have many functions to manipulate themselves. And, beyond that, there are a few additional string functions.
Back in "The Day", there were only functions to manipulate strings, because in days gone by, strings were more like simple data types then objects. But, these days strings are rich objects. In many cases, you'll find both the old functions and the new methods that do essentially the same thing. In these cases, it is considered proper form to use the method, not the function. But, in some cases, there is no equivalent method, in which case the function is the only way to go -- and perfectly fine.
In class we perused the official documentation. You should do the same as you study -- and again any time you need a quick reference:

Python Standard Library Strings

Inverted Index: Nested Collections Example
Today we built a data structure known as an Inverted Index. It is an index for quickly finding the locations of words in text files. We implemented it as a Dictionary that maps each word in the file to a List of the number of the lines of the file in which the word appears.
Dictionarys are the logical choice for mapping words to lists. For the lists, themselves, Tuples would have been a poor choice, because the list needs to change as we are building it. We could have used Sets, but then we'd have to sort the line numbers, since they would not be maintained in order, internally, by the Set. As a result, Lists are the correct choice for this applications.

invertedIndexExample.py
#!/usr/bin/python

# This function removes punctuation, etc, from words
# This is done to make sure that "hello," or "hello." are seen as "hello"
# This was mostly an exercise in using string library methods for practice
def cleanWord(word):
  
  # Get rid of spaces and dashes, convert to upper case as a canonical form
  word  = word.strip().upper().replace("-", "")

  # What about punctuation in the middle?
  # We'll index only the first part

  # Find first alphanum 
  begin = 0
  while (begin < len(word)):
    if word[begin:begin+1].isalnum():
     break
    begin += 1

  # Find last alphanum
  end = begin
  while (end < len(word)):
    if not (word[end:end+1].isalnum()):
      break
    end += 1

  # Keep only that splice
  word = word[begin:end]
  return word


# Gets a List containing each line of the file
def getFileLines(file):
  inputFile = open(file, "r")

  lines = inputFile.readlines()

  inputFile.close()

  return lines


# This actually builds the Inverted List
# It returns a Dictionary mapping word-->List[line numbers]
def buildIndex(lines):
  index = {} # Create an empty Dictionary
  lineNumber  = 0 # Number the first line 0

  # For each line, we know its number, walk through words adding to index
  for line in lines:

    # Strip leading and trailing space from line and split into list of words
    line = line.strip()
    words = line.split(" ")

    # For each fo those words
    # 1. Clean it to mkae it upper-case and w/out attached punctuation
    # 2. Add it to the index
    # 2a. Note that, if we haven't seen the word before, the dictionary
    #     is mapping None, not a List. So, we need to map a list,
    #     then, we can add the word to the list (or, alternately, 
    #     map a list with just the word). 
    #     Remember, we are mapping the word to a List.
    for word in words:
      word = cleanWord(word)
      try:
        index[word].append(lineNumber)
      except:
        index[word] = []
        index[word].append(lineNumber)
        # Could also do in one line, as below:
        # index[word] = [lineNumber ]

    lineNumber +=1 # Get ready for the next line, which has the next number

  # Processed all words of all lines -- return the complete index (Dictionary)
  return index


# This function uses our index
# Given a word, it looks it up in the index
# Then it either prints a message telling the user it isn't in the document
# Or, it uses the index to gets the list of lines, and then prints each
# one via the list of lines we originally created and used to build the index
def printMatchingLines(lines, index, word):

  # Canonicalize the word, as we did to create the index
  word = cleanWord(word)

  try: 
    lineNumbers = index[word] # This will get an IndexError, if not there
  except:
    print word + " was NOT found (sowwy!)." # So we tell the user
    return

  # Print the matching lines
  print word + " was found as below..."
  for lineNumber in lineNumbers:
    print str(lineNumber) + ": " + lines[lineNumber],
  print ""
  

# Index the declaration of independence
lines = getFileLines("declaration.txt")
index = buildIndex(lines)

# Look up some words, print matches (and no-match message)
printMatchingLines(lines, index, "friends")
printMatchingLines(lines, index, "byeByeBrits")