Hello, folks!
I'm here again bringing an overview about PE files. I think it's unproductive to write an extensive technical post about PE files since there is a lot of information about that in the web. I linked in the end of the post very good references about it.
So the idea here is just to give a
general idea of the PE file, then you will be able to do your own researches, you can use the reference as a start. I think it's more easy to learn when you can grab a general idea then start to digging more into each topic.
Anything that I wrote here is in the
reference, so you can use a reference to a more in depth reading. Any questions you are free to ask me I would be glad to help if a can.
Tools used in this post:
PE File Format
Let's start thinking about the
PE file with this poor analogy.
PE file it's some kind of recipe with all the ingredients inside of it. So you have all steps to execute the recipe with all ingredients included in it or with a reference to it.
This format is inherited from the
COFF (Common Object File Format) format that came along with VAX/VMS architecture, since Microsoft came from Digital Equipment Corporations that used
COFF format file. These formats serves as base for the loaders to read an executable on the system. So to quickly migrate to
Windows NT, the developers maintained the original format and enhanced it to
PE (Portable Executable).
The
PE format is used on Windows to execute programs and is the standard format, i.e. way of organize the data inside a file that make possible to all flavors of Windows to read it, load it and execute it. There aren't almost no difference between 32-bit and 64-bit
PE files, the difference resides mostly on field's size.
DLL and
EXE files uses the same
PE Format and differs just in some values of some fields, mainly in the "
File Header->Characteristics".
DLL is basically the same for
OCX and even
CPL files. Once you know the structure of the PE File you know how the executable is disposed on the memory when it's executed, therefore the loader will decide which parts of the file on disk will be mapped into the memory. Let's see a little overview on the pic below:
PE File - File on Disk and In Memory
All the information the loader will be map into the memory will be in the
PE file itself. And all the information about how to translate the
offset of the file on disk to the file mapped into the memory will be accessible in the file. Next is a
SVG image from Wikipedia has a nice view of the
PE:
When the
PE file is loaded into memory it is known as
module as all the other
PE files is imported by it. The beginning address of a PE file is know as
HMODULE as is referenced in Microsoft's
API. Differently when the file is on disk, in memory we have the concept of virtual memory, i.e. we
don't access the
real physical memory of our computer, the
OS creates a
virtual memory space to allocate everything, it acts mapping/translating the virtual addresses to the real physical memory. The
OS then can control better the memory management and security. Some regions of virtual memory space are
protected by the
Windows Memory Manager (Windows Component) that is specified in the section header of the
PE format to read-only, read/write and execute.
MS-DOS Header
Came in handy in the first version of windows, because windows machines isn't so common like in nowadays. So the executable could at least print some messaging asking for Windows to run it. This header and in the executables in general always starts with
e_magic field or
IMAGE_DOS_SIGNATURE, it's important to remeber this. The most important field here is the
e_lfanew that have the offset to
NT Header where all the useful information resides.
PE Sections
PE file sections are used to split the data in the file. Some sections represents code and other data. There is some kinds of data like, spaces to read and write information,
API import, function export, resources and so on. Every section in the PE file specifies what is in it. Commonly PE file has two type of section, code and data.
Windows Loader grab the information on the section header to properly load the section in memory. There is a code section and other data sections. Each section has it's own attributes like which type of data and if this section is read only or read/write in memory, all specified on the field
Characteristics in the
Section Header. In some cases the section can be shared between process if it specified.
Section names is just a way to better identify what is within, for the operating system it doesn't matter the name itself just the field
Characteristics that is indicating the type of the section.
It's important to remember that since the operating system uses
virtual memory protection the
Optional Header->SectionAlignment value (space between sections) in memory would be different from the file on disk. In disk the default value is multiple of 200h (hex) (so the offset in disk would be like 200h, 400h, 600h...), but in memory the loader maps the sections in a way that each section starts at the beginning of a memory page (which inherit the security flag read-only or read/write specified in the section header). Windows 32-bit has a page memory size of 4Kb and 64-bit 8kb. So for each architecture this would be the alignment of the section, anyway you can always check in the field in the
Optional Header->FileAlignment.
Relative virtual address (RVA)
The
RVA it's an important piece in the
PE file. It's used to located objects after the file is loaded in memory. When a
PE file is loaded in memory it starts at some determined address that we call
ImageBase (this address will be the
HMODULE) it's located in the
OpitionalHeader->ImageBase. To simplify everything let's see in the
CFF Explorer (PE Viewer) how they are expressed:
PE File - Section Header / Virtual Address
So to locate the
.text section in-memory we have to use the
VirtualAddress that is the
Relative Virtual Address of the section in-memory, it's relative because the final address to the section depends on the
ImageBase address. So to locate any section in-memory we need to add:
ImageBase + RVA. If we have the
ImageBase 0x400000 and the .text section
RVA 0x1000, the final address would be
0x401000, is where the .text section starts in-memory.
So in the header we have both the
RVA(VirtualAddress) and the
Offset(RawAddress) in disk. If we don't map the PE file on the memory we will use only the Raw Address, if the PE file is loaded we will use only the
ImageBase+VirtualAddress.
Data Directories
Data directories are data structure used keep information that the
PE files need. For example the imports section have a data structure that contains all the information necessary to the Windows Loader when loading the
PE file in memory. So it can load the imports before it starts to execute the code.
Examples of data directories is imports, export and resources. So in the PE file we have a header to located each one of these structures. In the nex image is the
Header of the Data Directories. here you have the
RVA for each Directory in memory and it's
size.
PE File - Data Directories
Importing Functions
In the
PE file we have an Directory containing all the information about the imported functions. Which
functions from which
DLLs, then the Loader can load and locate all the symbols it need to run the module.
When using imported functions from
DLLs the compiler
automatically compiles and generates the
PE file specifying in the import section which
DLLs is been used inside the file, so the
Windows Loader can properly load the
DLL and prepare it to be used by the file in run-time. Note that in my source-code I didn't imported all these functions, but the compiler did. I used the
GCC and as you can see it imported lot of functions for internal purposes like security and so on.
PE File - Import Directory
The
PE file keep an
array of data structures with all
DLL's imported. Each data structures have two arrays known as
Import Address Table (
IAT) and
Import Name Table (
INT). In the previous image we can see that each of these data structures have the name of the DLL (
ModuleName) along with two arrays
OFT (
OriginalFirstThunk /
INT) and
FT (
FirstThunk /
IAT).
The tricky part here is that both arrays has the "same" structure, because the structure itself is an union that could be any of the values defined in the structure. I recommend you to read the references to get a more comprehensive understanding about it, take time reading and exploring the
PE file. Though the tricky part, in general, the
FirstThunk field generally points to the
IAT array that is overwritten by the
Windows Loader with all the
API function addresses and
OriginalFirstThunk is an array with 2 fields Hint and Name. The Hint field it's the name of the imported function and hint is the ordinal of the function
API might be.
PE File - Import Descriptor
Once the
Windows Loader loaded the
DLLs and overwritten the
Import Address Table (
IAT) with all addresses that the
PE file need to import, all the calls to imported symbols (function
API) is redirected to the
IAT and finally to the real
API address.
In the run-time if the call to the imported symbol is redirected to a
JMP instruction, then it's accessing the
IAT before reach the
API. If the call doesn't passes through any
JMP then probably it's going directly to the
API.
Malloc CALL (IDA View)
IAT Jumping To Imported Malloc (IDA View)
In future post I will go in more details here, doing a manual DLL hijacking overwriting IAT. Stay tuned. :)
Exporting Functions
Exports is another data directory containing all the information about everything the
PE file exports. We refer to this exports as "
symbols", for example the
API LoadLibraryA is an
export symbol of
kernel32.dll. This directory is a little tricky, because it have some confusing pointers and rules, I will try to keep it simple and objective, but for a in-depth information please check the references.
When exporting functions or data to others modules all the information must be in the
Export Directory, because it's this information the
Windows Loader searches for when the other module is importing the
symbols in this
PE file. Symbols it's a term that includes anything that could be exported. Generally when some module exports symbols, the name of these symbols is the same as was originally coded on the source file. Let's have a look inside the export directory:
PE File - Export Directory
When we are consuming some
DLL and we need to import it's function, generally we call the
GetProcAddress to give us the address to that function. When we do that, internally the Windows Loader goes into the array
Export Name Table (
ENT / Field
AddressOfNames) gets the index of this function in the array and then access the same index in the array pointed by field
AddressOfNameOrdinals, the Loader saves the ordinal in the array at the index before mentioned (
ENT). The ordinal is actually the real index used to get the
RVA (
Relative Virtual Address) of the imported function. In the field
AddressOfFunctions has an array of all the exported functions each index of the array is a
RVA that points to the function. The tricky part is the field Base that is used with the ordinal, so to find the real index we need to add the field
Base+Ordinal resulting in the index for the
AddressOfFunctions array. Generally this field is
0x00000001 and all symbol is in order.
I think this part is the most important of all, then I will make it more detailed debugging the
GetProcAddress. I think it is interesting to see how the things works.
GetProcAddress is a
API imported from
kernel32.dll subsystem that uses the
kernelbase.dll that uses native
API ntdll.dll (undocumented).
If you want to try it I will let the code in my
github so you can compile the
DLL and the code that consumes it.
Remember at this point to be objective, follow the address that matter to you. I used the
x64dbg is a very good debugger has a lot of functionality and has a great community developing it. After the
LoadLibraryA our dll is is already loaded in memory, you can see it in the tab
Memory Map (inside x64dbg). In my case it was loaded in the address
0x6C300000. So this is the address that matter to us. Let's debug it. I breakpoint the
GetProcAddress:
Breakpoint - GetProcAddress
After the breakpoint I steped into until I found the begning of the process where the "
Windows Loader" begins to search the "
NONAME" symbol in the
Export Directory in our
DLL. I will not make this part too long so I will get right to the point. Debugging you can see that it received the
BaseAddress of our
DLL, then it got the
NT Header, checked if it is a valid
PE (
OptionalHeader->Magic value), got the
Export Directory RVA and
Export Directory Size, and now the
ntdll.dll is inside our export directory. As you can see in the next image
EAX already have the
AddressOfNames(
0x6C30602C), then it start to compare if the
name provided in the
GetProcAddress is the same as the exported from the
DLL.
EAX=AddressOfNames // ECX=PointerTo_NONAME_Function // EDX=NameToCompare
After the confirmation that is the same function and it's the
index 0, because was the first function of the
AddressOfNames. In the next image we will see that now it got the value in the
index 0 of the field array
AddressOfNameOrdinals and with this value it was able to sought the function address in the array field
AddressOfFunctions.
Getting the AddressOfNameOrdinals
Getting the NONAME symbol Address
Well, I hope that I could be clear engough explaining all the process to you and introducing you to the
Windows Loader,
basics of the hierarchy process of
subsystem and
native API, and how important is the
PE format to the Operating System. Any question just let me know, you can email me or call me on
twitter.
Some
PE files use only the ordinal value of the symbol to export it's symbols. Ordinal it's just an index of the symbol as a mentioned. So when some module try to import some symbol by ordinal in the Import Section will be specified which ordinal the Loader must search in the Export Section of the module imported.
Resource files
The resources files is another data directory in the
PE file. Generally have it's own section (
.rsrc), but it is not a rule. As I mentioned before the
PE file have all ingredients included in the recipe, so anything the
PE file needs it can include in itself. Any type of file can be included in the
PE file, after all any kind of files are just binary.
The resources are just embed files. There is some ways to get this resources in the run-time, advanced ways and basic ways. Generally we use the
Windows API to load the resources. I pretend to introduce to this methods in the future, for now let's see how it works.
PE File - Resource Directory
The resource directory it's a little confusing if you have to read these structures, but using a PE reader it's very easy. Works as a chain of structures, the main structure is the
IMAGE_RESOURCE_DIRECTORY that contains some fields. There are only two important fields
NumberOfNamedEntries and
NumberOfIdEntries, these two fields values has the size of the array of the next structure,
IMAGE_RESOURCE_DIRECTORY_ENTRY.
The
IMAGE_RESOURCE_DIRECTORY_ENTRY structure has two fields, Name and
OffsetToData. Now come the tricky part. If the most significant bit of the field Name is set (differ from 0), then the remaining bits is the offset to the name of the resource, if it's not set then it's a ID for the resource. If the most significant bit of the field
OffsetToData is set, then the remain bits is the offset to another
IMAGE_RESOURCE_DIRECTORY. If this field is not set, then the offset points to the resource itself. It's important to remember that this offset is always relative to the beginning of the resource section.
In
malware analysis it's a good place to keep additional parts of the malware to drop (droppers) on the machine. Generally malware do this trying to bypass antivirus alert, because lots of antivirus if not all of them do a
heuristics analysis searching for malicious behavior. So in many cases the malware (
PE file) in the resources are packed, when the main program is executed it unpack the file from resource section and drops it in some folder. To help us we have some tools like
CFF Explorer and
Resource Hacker.
To make it more clear let's compile a program with another executable inside of it. I will let the link to the source used here. I compiled everything with
GCC. Follow the link:
Well as you can see I used an icon and another source, follow the steps in the github to compile the first source inversing.c that will be inserted inside our main source blog_resource.c.
After you compiled the
inversing.c move the executable to the
blog_resource.c folder. In the
res.rc you can find the files names to compile, I used
"inversing.exe" you can use whataver you want, just change the name in the
.rc file. So you first have to compile the
res.rc to a
object file, for this we use the tool
windres from
GCC. Read the
README that will have all commands to compile, any questions please contact me.
After you have compiled the
blog_resources.exe you can execute it to see that the code gets the first two bytes from the
inversing.exe file that is {"
MZ"} or (
big-endian{
0x4D, 0x5A},
little-endian{
0x5A, 0x4D}), these two bytes represents the
IMAGE_DOS_SIGNATURE of all executable files. We can see in the
Resource Hacker.
Resource Hacker - inversing.exe inside blog_resources.exe
It's confusing, but you can read more about it on the reference. In future posts I want to bring a more detailed post on how a dropper could work using
Windows API and without any
API (it's possible too). Anyway in the reference you can read more about the functions I used to find the resource file.
.NET Header
This header is present in
PE files compiled in the
.NET Framework (obviously). This section is needed for specific information about .NET compilation such as metadata and intermediate language (
IL). Differently from directly assembly compiled languages like C/C++, .NET has it's
entry point on
MSCOREE.DLL which is the
DLL that will starts the execution based on the information from the
.NET header.
To make a little clear in the below image you can see a little overview on the .NET execution flow:
.NET Framework - Book Eldad Eilam - Reversing: Secrets of Reverse Engineering
The assembly part in this case is from the
.NET Framework, the application itself is the
Intermediate Language (
IL) interpreted by the
Common Language Runtime (
CLR). In this case the disassembler tool utilized is for the
IL.
Conclusion
First of all I do apologizes any mistake that I've been made and I would appreciate that if I did any, please contact me. Any question you may have I would be glad to help if I can. Well, as I already said I tried to keep it simple and objective. I hope this can be useful for anyone. This post is one of the series of introduction topics to my future posts with more technical text.
I linked all references in the bottom. So as I already said, the response to any questions that maybe arises certainly is in the references, anyway feel free to contact me. My post is simple and objective to give a direction about the PE Files, for more in-depth information the references is the way. Thanks! :)
References