Wednesday, October 18, 2017

PE - Portable Executable File

Hello, folks!

I'm here again bringing an overview about PE files. I think it's unproductive to write an extensive technical post about PE files since there is a lot of information about that in the web. I linked in the end of the post very good references about it.

So the idea here is just to give a general idea of the PE file, then you will be able to do your own researches, you can use the reference as a start. I think it's more easy to learn when you can grab a general idea then start to digging more into each topic.

Anything that I wrote here is in the reference, so you can use a reference to a more in depth reading. Any questions you are free to ask me I would be glad to help if a can.

Tools used in this post:

PE File Format


Let's start thinking about the PE file with this poor analogy. PE file it's some kind of recipe with all the ingredients inside of it. So you have all steps to execute the recipe with all ingredients included in it or with a reference to it.

This format is inherited from the COFF (Common Object File Format) format that came along with VAX/VMS architecture, since Microsoft came from Digital Equipment Corporations that used COFF format file. These formats serves as base for the loaders to read an executable on the system. So to quickly migrate to Windows NT, the developers maintained the original format and enhanced it to PE (Portable Executable).

The PE format is used on Windows to execute programs and is the standard format, i.e. way of organize the data inside a file that make possible to all flavors of Windows to read it, load it and execute it. There aren't almost no difference between 32-bit and 64-bit PE files, the difference resides mostly on field's size.

DLL and EXE files uses the same PE Format and differs just in some values of some fields, mainly in the "File Header->Characteristics". DLL is basically the same for OCX and even CPL files. Once you know the structure of the PE File you know how the executable is disposed on the memory when it's executed, therefore the loader will decide which parts of the file on disk will be mapped into the memory. Let's see a little overview on the pic below:

PE File - File on Disk and In Memory

All the information the loader will be map into the memory will be in the PE file itself. And all the information about how to translate the offset of the file on disk to the file mapped into the memory will be accessible in the file. Next is a SVG image from Wikipedia has a nice view of the PE:


When the PE file is loaded into memory it is known as module as all the other PE files is imported by it. The beginning address of a PE file is know as HMODULE as is referenced in Microsoft's API. Differently when the file is on disk, in memory we have the concept of virtual memory, i.e. we don't access the real physical memory of our computer, the OS creates a virtual memory space to allocate everything, it acts mapping/translating the virtual addresses to the real physical memory. The OS then can control better the memory management and security. Some regions of virtual memory space are protected by the Windows Memory Manager (Windows Component) that is specified in the section header of the PE format to read-only, read/write and execute.


MS-DOS Header


Came in handy in the first version of windows, because windows machines isn't so common like in nowadays. So the executable could at least print some messaging asking for Windows to run it. This header and in the executables in general always starts with e_magic field or IMAGE_DOS_SIGNATURE, it's important to remeber this. The most important field here is the e_lfanew that have the offset to NT Header where all the useful information resides.


PE Sections


PE file sections are used to split the data in the file. Some sections represents code and other data. There is some kinds of data like, spaces to read and write information, API import, function export, resources and so on. Every section in the PE file specifies what is in it. Commonly PE file has two type of section, code and data.

Windows Loader grab the information on the section header to properly load the section in memory. There is a code section and other data sections. Each section has it's own attributes like which type of data and if this section is read only or read/write in memory, all specified on the field Characteristics in the Section Header. In some cases the section can be shared between process if it specified.

Section names is just a way to better identify what is within, for the operating system it doesn't matter the name itself just the field Characteristics that is indicating the type of the section.

It's important to remember that since the operating system uses virtual memory protection the Optional Header->SectionAlignment value (space between sections) in memory would be different from the file on disk. In disk the default value is multiple of 200h (hex) (so the offset in disk would be like 200h, 400h, 600h...), but in memory the loader maps the sections in a way that each section starts at the beginning of a memory page (which inherit the security flag read-only or read/write specified in the section header). Windows 32-bit has a page memory size of 4Kb and 64-bit 8kb. So for each architecture this would be the alignment of the section, anyway you can always check in the field in the Optional Header->FileAlignment.


Relative virtual address (RVA)


The RVA it's an important piece in the PE file. It's used to located objects after the file is loaded in memory. When a PE file is loaded in memory it starts at some determined address that we call ImageBase (this address will be the HMODULE) it's located in the OpitionalHeader->ImageBase. To simplify everything let's see in the CFF Explorer (PE Viewer) how they are expressed:

PE File - Section Header / Virtual Address

So to locate the .text section in-memory we have to use the VirtualAddress that is the Relative Virtual Address of the section in-memory, it's relative because the final address to the section depends on the ImageBase address. So to locate any section in-memory we need to add: ImageBase + RVA. If we have the ImageBase 0x400000 and the .text section RVA 0x1000, the final address would be 0x401000, is where the .text section starts in-memory.

So in the header we have both the RVA(VirtualAddress) and the Offset(RawAddress) in disk. If we don't map the PE file on the memory we will use only the Raw Address, if the PE file is loaded we will use only the ImageBase+VirtualAddress.


Data Directories


Data directories are data structure used keep information that the PE files need. For example the imports section have a data structure that contains all the information necessary to the Windows Loader when loading the PE file in memory. So it can load the imports before it starts to execute the code.

Examples of data directories is imports, export and resources. So in the PE file we have a header to located each one of these structures. In the nex image is the Header of the Data Directories. here you have the RVA for each Directory in memory and it's size.

PE File - Data Directories


Importing Functions


In the PE file we have an Directory containing all the information about the imported functions. Which functions from which DLLs, then the Loader can load and locate all the symbols it need to run the module.

When using imported functions from DLLs the compiler automatically compiles and generates the PE file specifying in the import section which DLLs is been used inside the file, so the Windows Loader can properly load the DLL and prepare it to be used by the file in run-time. Note that in my source-code I didn't imported all these functions, but the compiler did. I used the GCC and as you can see it imported lot of functions for internal purposes like security and so on.

PE File - Import Directory


The PE file keep an array of data structures with all DLL's imported. Each data structures have two arrays known as Import Address Table (IAT) and Import Name Table (INT). In the previous image we can see that each of these data structures have the name of the DLL (ModuleName) along with two arrays OFT (OriginalFirstThunk / INT) and FT (FirstThunk / IAT).

The tricky part here is that both arrays has the "same" structure, because the  structure itself is an union that could be any of the values defined in the structure. I recommend you to read the references to get a more comprehensive understanding about it, take time reading and exploring the PE file. Though the tricky part, in general, the FirstThunk field generally points to the IAT array that is overwritten by the Windows Loader with all the API function addresses and OriginalFirstThunk is an array with 2 fields Hint and Name. The Hint field it's the name of the imported function and hint is the ordinal of the function API might be.

PE File - Import Descriptor

Once the Windows Loader loaded the DLLs and overwritten the Import Address Table (IAT) with all addresses that the PE file need to import, all the calls to imported symbols (function API) is redirected to the IAT and finally to the real API address.

In the run-time if the call to the imported symbol is redirected to a JMP instruction, then it's accessing the IAT before reach the API. If the call doesn't passes through any JMP then probably it's going directly to the API.

Malloc CALL (IDA View)

IAT Jumping To Imported Malloc (IDA View)


In future post I will go in more details here, doing a manual DLL hijacking overwriting IAT. Stay tuned. :)

Exporting Functions


Exports is another data directory containing all the information about everything the PE file exports. We refer to this exports as "symbols", for example the API LoadLibraryA is an export symbol of kernel32.dll. This directory is a little tricky, because it have some confusing pointers and rules, I will try to keep it simple and objective, but for a in-depth information please check the references.

When exporting functions or data to others modules all the information must be in the Export Directory, because it's this information the Windows Loader searches for when the other module is importing the symbols in this PE file. Symbols it's a term that includes anything that could be exported. Generally when some module exports symbols, the name of these symbols is the same as was originally coded on the source file. Let's have a look inside the export directory:

PE File - Export Directory

When we are consuming some DLL and we need to import it's function, generally we call the GetProcAddress to give us the address to that function. When we do that, internally the Windows Loader goes into the array Export Name Table (ENT / Field AddressOfNames) gets the index of this function in the array and then access the same index in the array pointed by field AddressOfNameOrdinals, the Loader saves the ordinal in the array at the index before mentioned (ENT). The ordinal is actually the real index used to get the RVA (Relative Virtual Address) of the imported function. In the field AddressOfFunctions has an array of all the exported functions each index of the array is a RVA that points to the function. The tricky part is the field Base that is used with the ordinal, so to find the real index we need to add the field Base+Ordinal resulting in the index for the AddressOfFunctions array. Generally this field is 0x00000001 and all symbol is in order.

I think this part is the most important of all, then I will make it more detailed debugging the GetProcAddress. I think it is interesting to see how the things works. GetProcAddress is a API imported from kernel32.dll subsystem that uses the kernelbase.dll that uses native API ntdll.dll (undocumented).

If you want to try it I will let the code in my github so you can compile the DLL and the code that consumes it.
Remember at this point to be objective, follow the address that matter to you. I used the x64dbg is a very good debugger has a lot of functionality and has a great community developing it. After the LoadLibraryA our dll is is already loaded in memory, you can see it in the tab Memory Map (inside x64dbg). In my case it was loaded in the address 0x6C300000. So this is the address that matter to us. Let's debug it. I breakpoint the GetProcAddress:


Breakpoint - GetProcAddress

After the breakpoint I steped into until I found the begning of the process where the "Windows Loader" begins to search the "NONAME" symbol in the Export Directory in our DLL. I will not make this part too long so I will get right to the point. Debugging you can see that it received the BaseAddress of our DLL, then it got the NT Header, checked if it is a valid PE (OptionalHeader->Magic value), got the Export Directory RVA and Export Directory Size, and now the ntdll.dll is inside our export directory. As you can see in the next image EAX already have the AddressOfNames(0x6C30602C), then it start to compare if the name provided in the GetProcAddress is the same as the exported from the DLL.

EAX=AddressOfNames // ECX=PointerTo_NONAME_Function // EDX=NameToCompare

After the confirmation that is the same function and it's the index 0, because was the first function of the AddressOfNames. In the next image we will see that now it got the value in the index 0 of the field array AddressOfNameOrdinals and with this value it was able to sought the function address in the array field AddressOfFunctions.

Getting the AddressOfNameOrdinals

Getting the NONAME symbol Address

Well, I hope that I could be clear engough explaining all the process to you and introducing you to the Windows Loader, basics of the hierarchy process of subsystem and native API, and how important is the PE format to the Operating System. Any question just let me know, you can email me or call me on twitter.

Some PE files use only the ordinal value of the symbol to export it's symbols. Ordinal it's just an index of the symbol as a mentioned. So when some module try to import some symbol by ordinal in the Import Section will be specified which ordinal the Loader must search in the Export Section of the module imported.

Resource files


The resources files is another data directory in the PE file. Generally have it's own section (.rsrc), but it is not a rule. As I mentioned before the PE file have all ingredients included in the recipe, so anything the PE file needs it can include in itself. Any type of file can be included in the PE file, after all any kind of files are just binary.

The resources are just embed files. There is some ways to get this resources in the run-time, advanced ways and basic ways. Generally we use the Windows API to load the resources. I pretend to introduce to this methods in the future, for now let's see how it works.

PE File - Resource Directory

The resource directory it's a little confusing if you have to read these structures, but using a PE reader it's very easy. Works as a chain of structures, the main structure is the IMAGE_RESOURCE_DIRECTORY that contains some fields. There are only two important fields NumberOfNamedEntries and NumberOfIdEntries, these two fields values has the size of the array of the next structure, IMAGE_RESOURCE_DIRECTORY_ENTRY.

The IMAGE_RESOURCE_DIRECTORY_ENTRY structure has two fields, Name and OffsetToData. Now come the tricky part. If the most significant bit of the field Name is set (differ from 0), then the remaining bits is the offset to the name of the resource, if it's not set then it's a ID for the resource. If the most significant bit of the field OffsetToData is set, then the remain bits is the offset to another IMAGE_RESOURCE_DIRECTORY.  If this field is not set, then the offset points to the resource itself. It's important to remember that this offset is always relative to the beginning of the resource section.

In malware analysis it's a good place to keep additional parts of the malware to drop (droppers) on the machine. Generally malware do this trying to bypass antivirus alert, because lots of antivirus if not all of them do a heuristics analysis searching for malicious behavior. So in many cases the malware (PE file) in the resources are packed, when the main program is executed it unpack the file from resource section and drops it in some folder. To help us we have some tools like CFF Explorer and Resource Hacker.

To make it more clear let's compile a program with another executable inside of it. I will let the link to the source used here. I compiled everything with GCC. Follow the link:

Well as you can see I used an icon and another source, follow the steps in the github to compile the first source inversing.c that will be inserted inside our main source blog_resource.c.


After you compiled the inversing.c move the executable to the blog_resource.c folder. In the res.rc you can find the files names to compile, I used "inversing.exe" you can use whataver you want, just change the name in the .rc file. So you first have to compile the res.rc to a object file, for this we use the tool windres from GCC. Read the README that will have all commands to compile, any questions please contact me.

After you have compiled the blog_resources.exe you can execute it to see that the code gets the first two bytes from the inversing.exe file that is {"MZ"} or (big-endian{0x4D, 0x5A}, little-endian{0x5A, 0x4D}), these two bytes represents the IMAGE_DOS_SIGNATURE of all executable files. We can see in the Resource Hacker.

Resource Hacker - inversing.exe inside blog_resources.exe

It's confusing, but you can read more about it on the reference. In future posts I want to bring a more detailed post on how a dropper could work using Windows API and without any API (it's possible too). Anyway in the reference you can read more about the functions I used to find the resource file.

.NET Header


This header is present in PE files compiled in the .NET Framework (obviously). This section is needed for specific information about .NET compilation such as metadata and intermediate language (IL). Differently from directly assembly compiled languages like C/C++, .NET has it's entry point on MSCOREE.DLL which is the DLL that will starts the execution based on the information from the .NET header.

To make a little clear in the below image you can see a little overview on the .NET execution flow:

.NET Framework - Book Eldad Eilam - Reversing: Secrets of Reverse Engineering


The assembly part in this case is from the .NET Framework, the application itself is the Intermediate Language (IL) interpreted by the Common Language Runtime (CLR). In this case the disassembler tool utilized is for the IL.

Conclusion


First of all I do apologizes any mistake that I've been made and I would appreciate that if I did any, please contact me. Any question you may have I would be glad to help if I can. Well, as I already said I tried to keep it simple and objective. I hope this can be useful for anyone. This post is one of the series of introduction topics to my future posts with more technical text.

I linked all references in the bottom. So as I already said, the response to any questions that maybe arises certainly is in the references, anyway feel free to contact me. My post is simple and objective to give a direction about the PE Files, for more in-depth information the references is the way. Thanks! :)

References

Windows Objects

Objects in windows are referred as kernel objects . They provide a link or an way to use any objects functionality according with the object...