Does NFS do anything weird with file encoding?

Red Squirrel · Mar 30, 2019

I have a simple program I run that adds comment headers to the top of files which a checksum and basic stats like lines of code and when it was last updated. To determine if the file changed it simply runs a checksum on the part of the file below the header (so it ignores any changes to the actual header). This works fine if I run it locally, but when I run it through NFS, it thinks every single file changed when it has not. When I run it locally again it will then think all the files changed again. It seems I can only run it from one place if I want consistent data.

Is NFS actually changing stuff like how the files are encoded? If yes, is there a way to stop that? I want it to present the files exactly as they are stored on disk and not start messing with them.

Scarpozzi · Mar 31, 2019

Is NFS just changing the file access timestamp? ....like when the process runs to check the file? See what extended attributes you can read on the file and see what changes through NFS when the file is opened.

Red Squirrel · Mar 31, 2019

I was thinking that too but my program strictly uses a checksum. In fact it ignores newlines and even code comments so I'm starting to think I have some other weird issue going on but it just seems odd that it makes a difference what system I run it on. The file should be the exact same regardless.

Scarpozzi · Apr 2, 2019

Red Squirrel said:
I was thinking that too but my program strictly uses a checksum. In fact it ignores newlines and even code comments so I'm starting to think I have some other weird issue going on but it just seems odd that it makes a difference what system I run it on. The file should be the exact same regardless.

If you copy a file....test locally and via NFS....then diff the two. See if that points you in the right direction. Consider your base file system too...there may be differences if you're using one of these fancy journaling file systems. (I was real big into ext2)

Otherwise, find out what NFS version you're on and see if there are any reported bugs with it. In recent years, I've been hit with a string of issues.

Red Squirrel · Apr 4, 2019

Yeah this is weird. I think it depends on the machine I run the app on, and not how the files are accessed. Linux binaries don't tend to be portable across systems so I wonder if it's because the program runs differently when it's being run locally vs on the server (where it's compiled on). I'll have to play with this further at some point as the whole idea of this program is to tag files only if they changed and it kind of ruins it if the simple act of running it on another system makes it think it changed. But think I did rule out NFS being the issue.

you2 · May 21, 2019

The md5 should be exactly the same whether locally mounted or remotely mounted. Try md5sum [filename] on both the local and remote machine. I suspect something is wrong with your program.
-
Now there is always a chance that data is being corrupted if your local or remote machine has a bad interface or bad memory. To be honest I have no clue what you are trying to do with your program - i've always just used the command line tools myself (btw md5sum is pretty weak these days from a security perspective and sha256 should be used).

Red Squirrel said:
I have a simple program I run that adds comment headers to the top of files which a checksum and basic stats like lines of code and when it was last updated. To determine if the file changed it simply runs a checksum on the part of the file below the header (so it ignores any changes to the actual header). This works fine if I run it locally, but when I run it through NFS, it thinks every single file changed when it has not. When I run it locally again it will then think all the files changed again. It seems I can only run it from one place if I want consistent data.

Is NFS actually changing stuff like how the files are encoded? If yes, is there a way to stop that? I want it to present the files exactly as they are stored on disk and not start messing with them.

you2 · May 21, 2019

I don't quite grasp what you are implying about the file system having an impact on data content. Yes the raw bits on disk might be different but he isn't reading the raw disk he is reading the user data. User data obtain via the file system calls should be identical whether it be zfs, ext4, ext2, brtfs, xfs, ... I.e, cat [path]/file will result in the same output unless his system is fundamentally broken (bad memory, bad disk, bad sata card, bad ....)

Scarpozzi said:
If you copy a file....test locally and via NFS....then diff the two. See if that points you in the right direction. Consider your base file system too...there may be differences if you're using one of these fancy journaling file systems. (I was real big into ext2)

Otherwise, find out what NFS version you're on and see if there are any reported bugs with it. In recent years, I've been hit with a string of issues.

Scarpozzi · May 21, 2019

you2 said:
I don't quite grasp what you are implying about the file system having an impact on data content. Yes the raw bits on disk might be different but he isn't reading the raw disk he is reading the user data. User data obtain via the file system calls should be identical whether it be zfs, ext4, ext2, brtfs, xfs, ... I.e, cat [path]/file will result in the same output unless his system is fundamentally broken (bad memory, bad disk, bad sata card, bad ....)

You are correct. In a perfect world, you won't have any issues with data content being messed with, but I'm sure you know filesystems are just low-level software code. It's not too common for low-level bugs to show up between versions because they're fixed pretty quickly, but it does happen.

NFS is pretty stable, but there have been some bugs over the past few years as I mentioned above. I was just trying to imply that something may be related to a bug. Correct, hardware could be to blame as well. I just try not to blame the hardware layer because that's not easily remedied like making sure your system is patched...

Red Squirrel · May 21, 2019

The reason I was asking is that I've run into weird encoding issues before, open a file, a TEXT file, and it's somehow all Chinese characters. So something is changing my files but not sure what, I was starting to wonder if maybe it's NFS, either on the fly, or writing it weirdly to disk, or something. Though I have not run into that issue in a while, it may have been a particular editor or program that was doing it. Encoding is a weird thing, I won't pretend to understand it.

Have not looked into this further but assuming NFS does not do anything weird it must be something off with my program, just don't quite get what, because it's such as simple program. This program does other stuff, not just checksum check, it also adds a header on the source files so I can track last update etc. The header part is not accounted for in the checksum, as the checksum only looks at code, and not comments. I think it may have to do with the fact that Linux binaries arn't very portable, so on one system it might be acting different than another. I'd have to recompile it on each system and see if it helps. Have not played around with it much as I tend to just run it on the dev server and all is fine. I had just happened to run it locally and noticed it was marking every single file as changed, when it was not actually the case.

you2 · May 22, 2019

That's totally different. The raw bits being read from disk are the same. The software is displaying the bits differently due to the encoding scheme you chosen to represent textual data but this is not related to how the data is stored on disk or what cat (for example) reads off the disk. You can do the same thing with a website but telling your browser to treat the encoding different.
-
This is way above the filesystem or system call (read/write) layer.
-
As for bugs in filesystems - yes they exist but it has many many years since the basic read/write calls have had issues on established filesystems (ext4, zfs, brtfs, ....). I've certainly not seen any nfs issues the past 2 decades beyond some advance fault tolerance issues (I've been using nfs for over 30 years).
-
I have to believe the issue is either faulty hardware or your program has a bug - if your program is less than 100 lines why not post it in this thread. Frequently people use read system call incorrectly under the presumption that the underlying file system is local and synchronous.

Red Squirrel said:
The reason I was asking is that I've run into weird encoding issues before, open a file, a TEXT file, and it's somehow all Chinese characters. So something is changing my files but not sure what, I was starting to wonder if maybe it's NFS, either on the fly, or writing it weirdly to disk, or something. Though I have not run into that issue in a while, it may have been a particular editor or program that was doing it. Encoding is a weird thing, I won't pretend to understand it.

Have not looked into this further but assuming NFS does not do anything weird it must be something off with my program, just don't quite get what, because it's such as simple program. This program does other stuff, not just checksum check, it also adds a header on the source files so I can track last update etc. The header part is not accounted for in the checksum, as the checksum only looks at code, and not comments. I think it may have to do with the fact that Linux binaries arn't very portable, so on one system it might be acting different than another. I'd have to recompile it on each system and see if it helps. Have not played around with it much as I tend to just run it on the dev server and all is fine. I had just happened to run it locally and noticed it was marking every single file as changed, when it was not actually the case.

Red Squirrel · Jul 9, 2019

Finally got around to playing further with this. I always run this app locally on the dev server but it still bothers me that the results change if I run it on another PC, it just makes no sense to me. So I'll still want to fix it.

It does seem to somehow be related to my algorithm though. The checksum actually DOES change, but only very slightly, and only in the same spot. It's really messed up. But at least this gives me a starting point. Added some debug info to the output.

Code:

Checksum changed.  
Old: [25311EC24F66ACD31F2D5F9AAA5D8C89]
new: [25311EC24F66ACD3262D5F9AAA5D8C89]
Updating testinc2/folder/asubfile.cpp...  Done.
Checksum changed.  
Old: [2919B210D431CA255FF69EA9E0E1A73F]
new: [2919B210D431CA2566F69EA9E0E1A73F]
Updating testinc2/bigfile1.cpp...  Done.
Checksum changed.  
Old: [A175B506FC7FD14124CBC198591E875]
new: [A175B506FC7FD14194CBC198591E875]
Updating testinc2/simplefile.cpp...  Done.
Checksum changed.  
Old: [0ED37E63AFEE6B0325FC6B58A5AEF]
new: [0ED37E63AFEE6B0395FC6B58A5AEF]
Updating testinc2/bigfile2.cpp...  Done.

I'll probably end up just redoing the entire algorithm and hope the problem just goes away. It does not really matter or need to be secure, the idea is just to see if a source file is different, while ignoring things like comments or new lines etc. It does not seem to be NFS related either, if I copy the files over to my local PC and run it, it also thinks it's different.

you2 · Jul 9, 2019

To be honest that looks more like a programming error - are you using the md5sum program; the thing is that if even one bit changed in the file the md5 would be radically different (in most cases). If you had parity errors i would expect much more radical changes in md5 computation because nfs data would be suspect to parity error as well as your program. linux comes with sha256sum and md5sum programs.

Red Squirrel · Jul 9, 2019

Yeah it is program error, but at first I was not sure so I ruled out NFS. In fact I replaced the file in the program with a text string and it still does it. If I run it on one computer I get a different result than another. Makes no sense though but going to break down the algorithm step by step and try to figure out where it goes wrong. I didn't want to have to use libraries as I wanted the program to be portable so I just wrote my own algorithm so the issue lies in there somewhere. Just not sure why it would differ between computers.

This is the code if curious:

I'm probably going to end up just rewriting the algorithm to see if the problem goes away.

Code:

//generates a checksum for the file.  This is a very basic custom algorthm that ignores stuff like carriage returns and anything that is considered a C++ comment 
string GetCheckSum(string data) 
{ 
   
  string mfile; //modified local file 
   
 
   
  for(int i=0;i<data.size();i++) 
  { 
     
    if(i<(data.size()+1) && data[i]=='/' && data[i+1]=='/') //1 line comment, ignore everything till new line 
    { 
      for(;i<data.size();i++) //loop till we find a \n or \r 
      { 
    if(data[i]=='\n' || data[i]=='\r')break; 
      } 
      continue;       
    } 
     
    if(i<(data.size()+1) && data[i]=='/' && data[i+1]=='*') //multi line comment, ignore everything till comment ending 
    { 
      for(;i<data.size();i++) //loop till we find end of comment 
      { 
    if(i<(data.size()+1) && data[i]=='*' && data[i+1]=='/')break; 
      } 
      continue; 
    } 
     
     
    //some chars we just want to ignore completely: 
     
    if(data[i]>=127)continue; //extended ascii     
    if(data[i]<=31)continue; //some control chars including carriage returns 
     
     
    mfile.append(1,data[i]); 
  } 
   
    
  //debug:   
  //cout<<"Org file:"<<endl<<data<<endl<<endl<<"New file: "<<endl<<mfile<<endl<<"size:"<<mfile.size()<<endl<<endl; 
   
   
   
  /*now that we have a clean data without comments and stuff, we do some misc computations to come up with a checksum.  
  These are pretty much winged, and it really does not matter what is done, the idea is to just come up with a somewhat unique 
  string that will change if even a minor change is done to the file. */ 
   
  const unsigned int CS_Size=16; //how big in bytes the checksum is (this is converted to hex string after)  
  const unsigned int CS_SizeD=CS_Size*2;    //double CS_Size (to avoid computing multiplication each time) 
   
   
  string workdata=mfile; 
   
   
  workdata="test";  //debug
  
   
  //if file is empty we add one character (this will ensure next step works) 
  if(workdata.size()<1)workdata.append(1,' '); //the actual data does not really matter so we'll just put a space 
   
   
  //Ensure file is at least the same size as (CS_SizeD), if not we add padding: 
   
  for(int i=0;workdata.size()<CS_SizeD;i++) 
  { 
    workdata.append(1,' '); //add padding 
  } 
   
  //Add some padding to the file so that it has blocks in a multiple of CS_SizeD: 
   
  int diff = CS_SizeD-(workdata.size()%CS_SizeD);  
  if(diff==CS_SizeD)diff=0;  
   
  cout<<"Dif: "<<diff<<endl;  //debug 
   
  for(int i=0;i<diff;i++) 
  { 
    workdata.append(1,workdata[i]);  //add padding 
  } 
   
  /* Note on padding: the actual padding we add really does not matter, as we're not really trying to do anything random at this point 
  the first round is spaces, second round is just repeating the workdata.  The only reason for difference is to make debugging a bit easier 
  technically if the first round runs, the second won't be required as the size will be well rounded. */ 
   
  cout<<HexDump(workdata)<<endl<<"size:"<<workdata.size()<<endl; //debug 
   
  /*here is where the fun begins.  We basically have at least two "rows" of CS_Size.     
  We'll just do some pseudo random manipulation of the data to combine them, then keep combining till we get a string 
  that is CS_Size, which is the hash.  */ 
   
  string checksum=workdata.substr(0,CS_Size);   //initiate checksum with the first row 
     
  string row1=""; 
  string row2=""; 
  unsigned int tmp1=0; 
  unsigned int tmp2=0; 
  unsigned int tmp3=0; 
   
  //some misc vars used for pseudo random generation: 
  unsigned int pr1=0;     
  unsigned int pr2=0;     
   
   
   
  do 
  { 
    //cout<<"workdata size:"<<(workdata.size())<<endl;  //debug 
    //get data to form 2 rows: 
     
      row1=workdata.substr(0,CS_Size);   
      row2=workdata.substr(CS_Size,CS_Size); 
 
      //delete only one row:       
      workdata.erase(0,CS_Size);  
       
      //cout<<"row1:  "<<HexDump(row1)<<endl<<"row2:  "<<HexDump(row2)<<endl;  //debug 
     
    //combine the two rows with some pseudo random basic math and mixing: 
     
    for(int i=0;i<CS_Size;i++) 
    { 
      int i2=CS_Size-i; 
       
      tmp1=(unsigned char)checksum[i]; 
      tmp2=(unsigned char)row1[i]; 
      tmp3=(unsigned char)row2[i];   
       
      pr1+=(unsigned char)row1[i2]; 
       
      tmp1 = (tmp1+1) * (tmp2 ^ pr1)/(i2+1) + i + tmp3/3; 
       
       
      //cout<<"     [i: "<<i<<" i2: "<<i2<<"   tmp1: "<<tmp1<<" tmp2: "<<tmp2<<" tmp3: "<<tmp3<<" pr1: "<<pr1<<"]    "; //debug 
       
      checksum[i]=tmp1%256;  
    } 
     
    pr1 = pr1/CS_Size; 
     
     
    //cout<<endl<<"checksum: "<<HexDump(checksum)<<endl<<"size: "<<(workdata.size())<<endl; //debug 
     
 
  }while(workdata.size()>=CS_Size); 
   
 
  //cout<<endl<<endl<<"Final Checksum: "<<HexDump(checksum)<<endl<<endl;//debug 
   
  return BitStream::Str2HexStr(checksum,0,false); 
}

Red Squirrel · Jul 10, 2019

Ok figured it out!

Basically when it gets to the end of the file, the string row2 will be empty, but it still tries to work with it so it was going past the end of the array. Buffer overflow, basically. Though I would have thought the string library would handle that by defaulting the value to a 0 or something.

I don't quite get why I was getting different results on different machines, but what counts is I fixed it. Though it still bothers me, really should not have been doing that.

So yeah this turned out to be a coding error after all. Not sure how I missed that when I first coded this app, trying to read passed the end of an array is an amateur mistake.

you2 · Jul 10, 2019

Any reason you don't just use md5sum; or if removing white space is a must something like:
cat file | tr -d "\t\n\r" | md5sum

Red Squirrel said:
Ok figured it out!

Basically when it gets to the end of the file, the string row2 will be empty, but it still tries to work with it so it was going past the end of the array. Buffer overflow, basically. Though I would have thought the string library would handle that by defaulting the value to a 0 or something.

I don't quite get why I was getting different results on different machines, but what counts is I fixed it. Though it still bothers me, really should not have been doing that.

So yeah this turned out to be a coding error after all. Not sure how I missed that when I first coded this app, trying to read passed the end of an array is an amateur mistake.

Red Squirrel · Jul 10, 2019

I would need to find a way to pipe the data to the command and get the result back from within the program, it would add a whole extra layer of complexity and it would depend on the command being available on each particular system. This program is meant to be cross platform as I use it for various projects. It's just a simple app to tag source files with a common header with basic info, and the checksum is just used to see if the header should be redone. So wanted to keep it simple. Ex header:

Code:

/**********************************************************
(This info generated by cppinventory v1.3.1)
test app 2 C++ source file
GPL license, etc
Last modified by Red Squirrel on Jul-10-2019 05:13:59pm
Checksum: 4A79DDBA3C4822DFAD495C1C7AD9740
Filepath: testinc2/folder/asubfile.cpp
Lines of code: 2

Description: 

***********************************************************/

Search

Does NFS do anything weird with file encoding?

Red Squirrel

No Lifer

Scarpozzi

Lifer

Red Squirrel

No Lifer

Scarpozzi

Lifer

Red Squirrel

No Lifer

you2

Diamond Member

you2

Diamond Member

Scarpozzi

Lifer

Red Squirrel

No Lifer

you2

Diamond Member

Red Squirrel

No Lifer

you2

Diamond Member

Red Squirrel

No Lifer

Red Squirrel

No Lifer

you2

Diamond Member

Red Squirrel

No Lifer

TRENDING THREADS