converting a html file to text

damonpip

Senior member
Mar 11, 2003
635
0
0
If you just need to convert it to plain text, it would be an easy program to write if you know C. You would just need to ignore everything between the < and > and put line breaks in the appropriate places. I only know Java though, and I don't have time to write it anyways.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Sounds simple at first, but html is not a simple thing to parse, and I'm sure it has the potential to get really ugly. There are probably tools out there, I would personally use lynx's -dump option.
 

Descartes

Lifer
Oct 10, 1999
13,968
2
0
It's not quite as easy as simply ignoring everything between '<' and '>', because you can still have an attribute with a quoted '<' or '>' that doesn't actually indicate the opening/closing of a tag. Being the loser I am, I decided to throw something together:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef enum
{
IN_TAG,
IN_QUOTE,
NORMAL
} parseState;

int main(int argc, char **argv)
{
FILE *inputFile = NULL;
FILE *outputFile = NULL;
int c = 0;
parseState ps = NORMAL;

if (argc < 3)
{
fprintf(stderr, "args");
exit(EXIT_FAILURE);
}

inputFile = fopen(argv[1], "r");
outputFile = fopen(argv[2], "w");

while (!feof(inputFile))
{
c = fgetc(inputFile);

switch (ps)
{
case IN_TAG:
if (c == '\'' || c == '"')
ps = IN_QUOTE;

if (c == '>' && ps != IN_QUOTE)
{
ps = NORMAL;
continue;
}

break;

case IN_QUOTE:
if (c == '\'' || c == '"')
ps = IN_TAG;

case NORMAL:
if (c == '<')
ps = IN_TAG;

break;
}

if (ps == NORMAL)
fputc(c, stdout);
}

fclose(outputFile);
fclose(inputFile);

return 0;
}

Yeah, it's a really weak state machine, but that's what you get for fifteen minutes of work.
 

singh

Golden Member
Jul 5, 2001
1,449
0
0
Originally posted by: Descartes
Yeah, it's a really weak state machine, but that's what you get for fifteen minutes of work.

Just formatting it for you...

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef enum
{
IN_TAG,
IN_QUOTE,
NORMAL
} parseState;

int main(int argc, char **argv)
{
FILE *inputFile = NULL;
FILE *outputFile = NULL;
int c = 0;
parseState ps = NORMAL;

if (argc < 3)
{
fprintf(stderr, "args");
exit(EXIT_FAILURE);
}

inputFile = fopen(argv[1], "r");
outputFile = fopen(argv[2], "w");

while (!feof(inputFile))
{
c = fgetc(inputFile);

switch (ps)
{
case IN_TAG:
if (c == '\'' || c == '"')
ps = IN_QUOTE;

if (c == '>' && ps != IN_QUOTE)
{
ps = NORMAL;
continue;
}

break;

case IN_QUOTE:
if (c == '\'' || c == '"')
ps = IN_TAG;

case NORMAL:
if (c == '<')
ps = IN_TAG;

break;
}

if (ps == NORMAL)
fputc(c, stdout);
}

fclose(outputFile);
fclose(inputFile);

return 0;
}