python parsing question

postmark

Senior member
May 17, 2011
307
0
0
What is the best way to parse a file that has this type of format into lists in python?


Code:
Label1 = "c:\temp1;" +
                     "c:\temp2;" +
                     "c:\temp3;" +
                     "c:\temp4";

Label2 = "c:\label2_1;" +
             "c:\label2_2;" +
             "c:\label2_3";

I would like to return two lists, label1 = ['c:\\temp1','c:\\temp2','c:\\temp3','c:\\temp4'] and label2 = ['c:\\label2_1','c:\\label2_2','c:\\label2_3']

Thanks!
 

purbeast0

No Lifer
Sep 13, 2001
52,859
5,732
126
lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.
 

postmark

Senior member
May 17, 2011
307
0
0
lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.

Not homework, this is for work. My initial thought is to find the labels then do a split on the quotes, but i was having a hard time figuring out where to stop.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,250
3,845
75
I'm slightly confused, since the code you posted isn't valid Python. So you're getting this out of another file?
 

postmark

Senior member
May 17, 2011
307
0
0
I'm slightly confused, since the code you posted isn't valid Python. So you're getting this out of another file?

correct. This is an input file that i wanted an easy way to pull all those directories into one directory. So I was going to use a python script to parse this and dump all the data from those directories to one directory.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,250
3,845
75
re.search seems like a better option than split. One for each kind of thing you want to search for. Each time you find an assignment, create a new list. Each time you find a path, append it to the current list.
 

postmark

Senior member
May 17, 2011
307
0
0
lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.

OK, I realized I only needed 1 list in the end, so I got it working with this, but I was hoping for something more elegant... But this is good for now.

Code:
paths = []
with open ('c:\\temp\\test.txt', 'r') as fin:
 flag = False
 for line in fin:
  if line.lstrip().startswith('label'):
   flag = True
  if flag: 
   if line.rstrip().endswith(';'):
    flag == False
   if line.strip():
    if '"' in line:
     sp = line.strip().split('"')
     if sp[1] != '':
      paths.append(sp[1].rstrip(';+" ').lstrip())
   
print (paths)
 

purbeast0

No Lifer
Sep 13, 2001
52,859
5,732
126
ive never done python, but this seems like it is easily solvable with regular expressions. does python support regular expressions?
 

Childs

Lifer
Jul 9, 2000
11,450
7
81
I suck at regular expressions but maybe this can get you most of the way:

Code:
#!/usr/bin/python
import re
data = None
with open('data.txt', 'r') as fp:
	data = fp.read()

newData = data.split(';\n')

items = []
for i in newData:
	line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)\W"\s*.*\s*"(\S*)"', i)
	if not line:
		line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)"', i)
	line = line[0]
	items.append(list(line[1:]))

print items

It works with your sample data, but I know this can be done better. I'm not sure how to handle variable number of fields in your input file, so obviously it wouldn't work if there are sometimes one file, sometimes 10, etc. Every time regular expressions come up I have to learn it all over again. Its always just long enough between uses that I forget it all. lol
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
Some people, when they have a problem, think... I know, I'll solve it with regex! And now they have two problems. :)

Edit: As a serious answer to OP's question, if you have control over the data format, use something standard like Json. If you don't have control over it, go to the person who does, smack them, and tell them to use a standard format like Json.
 
Last edited:

postmark

Senior member
May 17, 2011
307
0
0
I suck at regular expressions but maybe this can get you most of the way:

Code:
#!/usr/bin/python
import re
data = None
with open('data.txt', 'r') as fp:
	data = fp.read()

newData = data.split(';\n')

items = []
for i in newData:
	line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)\W"\s*.*\s*"(\S*)"', i)
	if not line:
		line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)"', i)
	line = line[0]
	items.append(list(line[1:]))

print items

It works with your sample data, but I know this can be done better. I'm not sure how to handle variable number of fields in your input file, so obviously it wouldn't work if there are sometimes one file, sometimes 10, etc. Every time regular expressions come up I have to learn it all over again. Its always just long enough between uses that I forget it all. lol


This seriously makes my head hurt just looking at it. I know regex can be powerful, but damn that is not intuitive to read!
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,039
431
126
I would simply use Perl and override the input line separator. This way when you process the "lines" in the file, instead of getting a single line at a time, you would get all the data for each label, and can then easily write a function to process each label. Like the following:
Code:
#!/usr/bin/perl
use warnings;
my $filename = "data.txt" #for simplicity, but you can do the whole input arg thing
my %data = (); # new empty hash/associative array
open(FIN, "$filename") or die("Could not open file: $filename : $!");
{
   local $/ = "\n\n"; #assuming that is really just blank line between the data
   while(<FIN>) {
      my $line = $_; #explicitly stating this for ppl who don't know perl
      process_data($line, \%data);
   }
   close(FIN);
}
foreach my $key (sort(keys(%data))) {
   print "$key = [$data{$key}]\n";
}
sub process_data {
   my $chunk = shift; 
   my $data_ptr = shift; #pointer to associative array
   my $label = "";
   my $label_val = "";
   my @tmp = split(/\n/, $chunk);
   foreach my $t (@tmp) {
      if ($t =~ /\S+\s=/) {
         #this is a lable line
         ($label, my $t_val) = ($t =~ /(\S+)\s=\s\"([^\"\;]+)/);
         $label_val = "\'$t_val\'";
      }
      elsif ($t =~ /\;/) {
         (my $t_val) = ($t =~ /\"([^\"\;]+)\;/);
         $label_val .= ",\'$t_val\'";
      }
   }
   $data_ptr->{$label} = $label_val;
}

Obviously this can be much shorter as well, just that as you get shorter, it makes more assumptions that people maintaining the code know more of the tricks of perl. I also like creating a separate function for processing each label+directory as it is a repeated task worthy of such a breakup of the code.

EDIT: fixed a typo
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,250
3,845
75
This seriously makes my head hurt just looking at it. I know regex can be powerful, but damn that is not intuitive to read!

Yes, that's pretty ugly. I now feel compelled to contribute my own solution, if only to prove regexes don't have to be that ugly. ;)

Code:
#!/usr/bin/python
import re
data = None
with open('data.txt', 'r') as fp:
    data = fp.read()

newData = data.split('\n')

items = {}
varname = ""
for line in newData:
    assign = re.search(r'([a-zA-Z0-9_]+) *=', line)
    if assign and not items.get(assign.group(1), False):
        varname = assign.group(1)
        items[varname] = []

    s = re.search(r'"([^";]+);*"', line)
    if s:
        items[varname].append(s.group(1))

print items
 
Last edited:

Childs

Lifer
Jul 9, 2000
11,450
7
81
Yes, that's pretty ugly. I now feel compelled to contribute my own solution, if only to prove regexes don't have to be that ugly. ;)

I knew posting a crappy solution would disgust someone into making a better version. I'm surprised it took so long. :biggrin:

In retrospect I dont know why I wanted to combine each record into one line vs just looking at each line in the text file. At the time that made more sense, but I should point out that the time was like 2AM.
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,039
431
126
I knew posting a crappy solution would disgust someone into making a better version. I'm surprised it took so long. :biggrin:

In retrospect I dont know why I wanted to combine each record into one line vs just looking at each line in the text file. At the time that made more sense, but I should point out that the time was like 2AM.

It doesn't help that the data file has 3 different cases for the data. It actually might have 4, but we do not have an example of a line that has a label and only a single directory. In fact, I highly suspect that there are 4:

1) Label and directory with at least one more directory being label = "dirname;" +
2) Label just a single directory being label = "dirname";
3) second or more of multiple directories with at least one more directory not yet named
4) final directory with it being "dirname";

The fact that it uses a different location of the quotes and semi-colon depending on if it is the final directory is the real pain. If it simply was consistent, there would only be 2 cases.
 
Last edited: