I would call it "transcoding" rather than "encoding" (but of course, since transcoding is not a "proper" word no one else does) since most of these operations require a deconding and an encoding step. Basically this is what happens: your pc starts renders a frame into a raw format (renders it to memory, not necessarily to the screen) by decoding the initial video stream. After this the program starts encoding this raw image information onto its own format. The encoding part of this is very intensive because the raw images have to be encoded onto an extremely narrow data stream (encoding 120 mins of 30fps video onto independent 100K JPEG frames would take around 21.5GB of data for video alone, and playback would be pretty CPU-intense, too) which means it mostly relies on changing the frame before it to fit the image as closely as possible.