python parsing question

postmark · May 23, 2016

What is the best way to parse a file that has this type of format into lists in python?

Code:

Label1 = "c:\temp1;" +
                     "c:\temp2;" +
                     "c:\temp3;" +
                     "c:\temp4";

Label2 = "c:\label2_1;" +
             "c:\label2_2;" +
             "c:\label2_3";

I would like to return two lists, label1 = ['c:\\temp1','c:\\temp2','c:\\temp3','c:\\temp4'] and label2 = ['c:\\label2_1','c:\\label2_2','c:\\label2_3']

Thanks!

purbeast0 · May 23, 2016

lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.

postmark · May 23, 2016

purbeast0 said:
lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.

Not homework, this is for work. My initial thought is to find the labels then do a split on the quotes, but i was having a hard time figuring out where to stop.

Ken g6 · May 23, 2016

I'm slightly confused, since the code you posted isn't valid Python. So you're getting this out of another file?

postmark · May 23, 2016

Ken g6 said:
I'm slightly confused, since the code you posted isn't valid Python. So you're getting this out of another file?

correct. This is an input file that i wanted an easy way to pull all those directories into one directory. So I was going to use a python script to parse this and dump all the data from those directories to one directory.

Ken g6 · May 23, 2016

re.search seems like a better option than split. One for each kind of thing you want to search for. Each time you find an assignment, create a new list. Each time you find a path, append it to the current list.

postmark · May 23, 2016

purbeast0 said:
lol this screams homework. at least show us you are even trying to solve this yourself with what you've come up with thus far. but i'd guess an attempt hasn't even been made yet.

OK, I realized I only needed 1 list in the end, so I got it working with this, but I was hoping for something more elegant... But this is good for now.

Code:

paths = []
with open ('c:\\temp\\test.txt', 'r') as fin:
 flag = False
 for line in fin:
  if line.lstrip().startswith('label'):
   flag = True
  if flag: 
   if line.rstrip().endswith(';'):
    flag == False
   if line.strip():
    if '"' in line:
     sp = line.strip().split('"')
     if sp[1] != '':
      paths.append(sp[1].rstrip(';+" ').lstrip())
   
print (paths)

purbeast0 · May 23, 2016

ive never done python, but this seems like it is easily solvable with regular expressions. does python support regular expressions?

postmark · May 23, 2016

purbeast0 said:
ive never done python, but this seems like it is easily solvable with regular expressions. does python support regular expressions?

It does, but I suck at reg ex

purbeast0 · May 23, 2016

postmark said:
It does, but I suck at reg ex

you aren't going to get better by not trying to learn

i suck at regex too partially because i don't use it often, but when i do i refer to this.

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Childs · May 26, 2016

I suck at regular expressions but maybe this can get you most of the way:

Code:

#!/usr/bin/python
import re
data = None
with open('data.txt', 'r') as fp:
	data = fp.read()

newData = data.split(';\n')

items = []
for i in newData:
	line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)\W"\s*.*\s*"(\S*)"', i)
	if not line:
		line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)"', i)
	line = line[0]
	items.append(list(line[1:]))

print items

It works with your sample data, but I know this can be done better. I'm not sure how to handle variable number of fields in your input file, so obviously it wouldn't work if there are sometimes one file, sometimes 10, etc. Every time regular expressions come up I have to learn it all over again. Its always just long enough between uses that I forget it all. lol

Merad · May 26, 2016

Some people, when they have a problem, think... I know, I'll solve it with regex! And now they have two problems.

Edit: As a serious answer to OP's question, if you have control over the data format, use something standard like Json. If you don't have control over it, go to the person who does, smack them, and tell them to use a standard format like Json.

postmark · May 29, 2016

Childs said:
I suck at regular expressions but maybe this can get you most of the way:

Code:

#!/usr/bin/python import re data = None with open('data.txt', 'r') as fp: data = fp.read() newData = data.split(';\n') items = [] for i in newData: line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)\W"\s*.*\s*"(\S*)"', i) if not line: line = re.findall(r'(.*) = "(.*)\W"\s*\W\s*"(.*)\W"\s*\W\s*"(\S*)"', i) line = line[0] items.append(list(line[1:])) print items

It works with your sample data, but I know this can be done better. I'm not sure how to handle variable number of fields in your input file, so obviously it wouldn't work if there are sometimes one file, sometimes 10, etc. Every time regular expressions come up I have to learn it all over again. Its always just long enough between uses that I forget it all. lol

This seriously makes my head hurt just looking at it. I know regex can be powerful, but damn that is not intuitive to read!

Fallen Kell · May 29, 2016

I would simply use Perl and override the input line separator. This way when you process the "lines" in the file, instead of getting a single line at a time, you would get all the data for each label, and can then easily write a function to process each label. Like the following:

Code:

#!/usr/bin/perl
use warnings;
my $filename = "data.txt" #for simplicity, but you can do the whole input arg thing
my %data = (); # new empty hash/associative array
open(FIN, "$filename") or die("Could not open file: $filename : $!");
{
   local $/ = "\n\n"; #assuming that is really just blank line between the data
   while(<FIN>) {
      my $line = $_; #explicitly stating this for ppl who don't know perl
      process_data($line, \%data);
   }
   close(FIN);
}
foreach my $key (sort(keys(%data))) {
   print "$key = [$data{$key}]\n";
}
sub process_data {
   my $chunk = shift; 
   my $data_ptr = shift; #pointer to associative array
   my $label = "";
   my $label_val = "";
   my @tmp = split(/\n/, $chunk);
   foreach my $t (@tmp) {
      if ($t =~ /\S+\s=/) {
         #this is a lable line
         ($label, my $t_val) = ($t =~ /(\S+)\s=\s\"([^\"\;]+)/);
         $label_val = "\'$t_val\'";
      }
      elsif ($t =~ /\;/) {
         (my $t_val) = ($t =~ /\"([^\"\;]+)\;/);
         $label_val .= ",\'$t_val\'";
      }
   }
   $data_ptr->{$label} = $label_val;
}

Obviously this can be much shorter as well, just that as you get shorter, it makes more assumptions that people maintaining the code know more of the tricks of perl. I also like creating a separate function for processing each label+directory as it is a repeated task worthy of such a breakup of the code.

EDIT: fixed a typo

Ken g6 · May 29, 2016

postmark said:
This seriously makes my head hurt just looking at it. I know regex can be powerful, but damn that is not intuitive to read!

Yes, that's pretty ugly. I now feel compelled to contribute my own solution, if only to prove regexes don't have to be that ugly.

Code:

#!/usr/bin/python
import re
data = None
with open('data.txt', 'r') as fp:
    data = fp.read()

newData = data.split('\n')

items = {}
varname = ""
for line in newData:
    assign = re.search(r'([a-zA-Z0-9_]+) *=', line)
    if assign and not items.get(assign.group(1), False):
        varname = assign.group(1)
        items[varname] = []

    s = re.search(r'"([^";]+);*"', line)
    if s:
        items[varname].append(s.group(1))

print items

Childs · May 29, 2016

Ken g6 said:
Yes, that's pretty ugly. I now feel compelled to contribute my own solution, if only to prove regexes don't have to be that ugly.

I knew posting a crappy solution would disgust someone into making a better version. I'm surprised it took so long. :biggrin:

In retrospect I dont know why I wanted to combine each record into one line vs just looking at each line in the text file. At the time that made more sense, but I should point out that the time was like 2AM.

Fallen Kell · May 29, 2016

Childs said:
I knew posting a crappy solution would disgust someone into making a better version. I'm surprised it took so long. :biggrin:

In retrospect I dont know why I wanted to combine each record into one line vs just looking at each line in the text file. At the time that made more sense, but I should point out that the time was like 2AM.

It doesn't help that the data file has 3 different cases for the data. It actually might have 4, but we do not have an example of a line that has a label and only a single directory. In fact, I highly suspect that there are 4:

1) Label and directory with at least one more directory being label = "dirname;" +
2) Label just a single directory being label = "dirname";
3) second or more of multiple directories with at least one more directory not yet named
4) final directory with it being "dirname";

The fact that it uses a different location of the quotes and semi-colon depending on if it is the final directory is the real pain. If it simply was consistent, there would only be 2 cases.

Search

python parsing question

postmark

Senior member

purbeast0

No Lifer

postmark

Senior member

Ken g6

Programming Moderator, Elite Member

postmark

Senior member

Ken g6

Programming Moderator, Elite Member

postmark

Senior member

purbeast0

No Lifer

postmark

Senior member

purbeast0

No Lifer

Childs

Lifer

Merad

Platinum Member

postmark

Senior member

Fallen Kell

Diamond Member

Ken g6

Programming Moderator, Elite Member

Childs

Lifer

Fallen Kell

Diamond Member

TRENDING THREADS