Python tips – List iteration and item removal
Occasionally my code is required to work through a list and delete items that are not wanted. Take, for example, the case of appending the data from two CSV files to create a new file with both sets of data in. It may be that they don’t have exactly the same columns in, but the common columns are the only ones you are interested in anyway. This is slightly more complex than simply ‘cat’ing the source files into the target file, so let’s assume you want to knock together a quick Python script to do it. I will first present the easy but wrong way to do it and then show how to adjust it to be correct.
The incorrect way
The first step is to compare the header rows from each source file to figure out which columns are common (this will also form the basis of the new header row to write to the target file). This would imply code such as the following:
if combined_header == []:
combined_header = this_header[:]
else:
for col in combined_header:
if not(col in this_header):
combined_header.remove(col)
This could of course be achieved using intersections of sets. However, that wouldn’t preserve the ordering of the columns, which I’m going to assume here is important or at least useful in some way.
Ok, so let’s test this. Assume the code has already looked at the two source files (we set up the values manually here), giving
>>> combined_header = ['a', 'b', 'c', 'd', 'e']
>>> this_header = ['a', 'b', 'd', 'e']
>>> Run the code above
>>> print combined_header
['a', 'b', 'd', 'e']
Ok, so far so good. How about another test case?
>>> combined_header = ['a', 'b', 'c', 'd', 'e']
>>> this_header = ['a', 'b', 'e']
>>> Run the code above
>>> print combined_header
['a', 'b', 'd', 'e']
Wait a minute, wasn’t it supposed to remove ’d' from the list? If we add a print col after the start of the for loop then we can see why:
a
b
c
e
It never even looks at the ’d' to check if it should be there. The reason for this is that the iterator you are using to scan along the list doesn’t see the change occur when the ‘c’ is removed. As far as it’s concerned, the next item to look at is still index 3, which is now the ‘e’.
The correct way
The fix, as it turns out, is easy. After all, this is Python we’re talking about. Simply scan the list in reverse order instead:
if combined_header == []:
combined_header = this_header[:]
else:
for col in reversed(combined_header):
print col
if not(col in this_header):
combined_header.remove(col)
This results in:
>>> combined_header = ['a', 'b', 'c', 'd', 'e']
>>> this_header = ['a', 'b', 'e']
>>> Run the code above
e
d
c
b
a
>>> print combined_header
['a', 'b', 'e']
Job done, one less annoying bug to track down later.