Inversing: Basics of Assembly

Introduction to Assembly

Hello everyone, it's me again bringing some basic stuff. Even being a basic stuff I hope that can help anyone. I am making this basic posts to my incoming ones, I will bring more technical analysis. Sorry for any mistakes that I maybe did on this post and if I did any, please send me a feedback.
Well, the assembly code it's the "only" way on reversing engineering. Or you interpret assembly mnemonics or you analyze opcodes (machine code). As said in the previous introductory topic (if you didn't read, I do recommend) the assembly code it's generated by the compiler for every code language that you use. So, understanding it is vital to RE. Assembly language doesn't is standart for all assemblers, the mnemonics(instructions) are different deppending of your processor architecture. Since the processor is made of several circuits and each processor has your own cricuits, the logic behind it's processing differs from each other. So the instructions that they "understand" is different from each processor architecture. In this blog I will give the approach of IA-32 architecture. I will not "teach", is more like an approach of what it is and how we work with. So knowing how the computer "works", as Memory-Data X Instruction-Operation. Logically the computer stocks all data on memory and all this data is interpreted and executed by the CPU.

Computer / Memory, CPU, IO Devices and BUS

The processor have some pointers to help the CPU keep track of what it need to do with what data. Then we have the instruction pointer and the data pointer. Through the posts we will see it in practice. So the instruction pointer points to the memory block that represents an instruction and the data pointer to a memory block that represents the data, then the CPU executes the instruction with the data that is pointed.

CPU Units X Memory

These pointers allocate the memory offsets in registers. As the processor executes these instructions the instruction pointer goes to the next instruction and the data pointer too. The instruction has between 1-3 bytes and is called opcode (operation code). So basically assembly has three "parts", opcode mnemonics, data sections and directives.

Opcode

Mnemonic code is the "english" representation of the instruction code, e. g. the '89' instruction is the 'mov' mnemonic. Different assembly types represent instructions differently.

OllyDbg / Right Assembly, Left OpCodes

Data

The data sections is the space used to store the data which the instrunction will use to execute, so this data can be in some memory section or it can use the stack (memory area, more later). All the data is stored in the hex representation and is referenced by it's memory address. So every data stored on the system has a memory address, it can be a immediate constant in the assembly code or it can be stored on the stack frame.

Directives

Directives are the elements used in assembly to tell the assembler (which compiles the assembly) how to interpret this type of data, data includes everything, like code and values. For example if you want to store a float value the assembler needs to know, then it can reserve memory properly. One of the important directives of assembly is the ".section" directive. This directive creates sections on memory for each type of data.

Sections

We can have any kind of section we want, but all programs have this by default:

.text:

All the code instructions are alocated in this section. No data is allowed here, except some fixed data of variables like a = 5, but that depends of the programmer on low-level programming and depends of the compiler on high-level programming.

.data:

This section is responsible for stores all data that the .text request. It will be referenced as an address in the .text section, like program.ADDRESS.

.bss:

This section generally used to unitialized data. I think the name might change from language to language, don't sure.

The IA-32 Architecture

The IA-32 architecture was designed for pentium processors by Intel. I don't know for sure if it is the most used nowdays, but it's a very known one and have a lot of documentation about it. When you learn assembly basis in one architecture, makes easy to learn in another one, because the base still the same. Some assembly let you do more others let do less, I think it's the basic difference. In my last topic I tried to wrote about computer in general form and I think that you already know how it is. I don't want to make this part too much extensive. So basically IA-32 is divided in 4 parts:

Control unit:
- Control is responsible for bring all the information from memory, data and instructions. Then it decodes this instructions into micro-operations and pass to execution unit. The result of the operation is passed back for control unit that stores the result.
Execution unit:
- Responsible for execute all the micro-operations.
Registers:
- Registers are responsible to keep track of data that are been used. This registers are internal memory of the processor. Having this little memory inside the processor make it much more faster than going outside (from processor itself) searching data in RAM memory and retrieve it.
- I want to keep it objective so I will not write about all of them. You can read in the book listed on the reference, I do recommend. So there are basically four types of registers (there are more like I said):
  - General Purposes: 8 32-bits registers. They are used to work with data.
  - Segment: 6 16-bit registers. Used to memory access.
  - Intruction Pointer: 1 32-bit register. Points to the next instruction to be executed.
  - Floating-point: 8 80-bit register. It used to work with floating-point numbers.
Flags:
- Flags are used to keep control of the operations executed by processor. It's a way to know if some operation worked or not. There are specific type of flags to specific operations. We will see it.

Registers

General Purposes

These registers are mainly used to work with data as the code is been executed. All data used in the instructions are stored in these registers. They are 32-bits longer at the "top" level (32-bits), but it can be sliced in minor parts (16-bits, 8-bits) that stores minor data.

EAX (32-bits):
- AH (16-bits):
  - AL (8-bits)
EBX (32-bits):
- BH (16-bits):
  - BL (8-bits)
ECX (32-bits):
- CH (16-bits):
  - CL (8-bits)
EDX (32-bits):
- DH (16-bits):
  - DL (8-bits)
EDI (32-bits):
- DI (16-bits)
ESI (32-bits):
- SI (16-bits)
EBP (32-bits):
- BP (16-bits)
ESP (32-bits):
- SP (16-bits)

OllyDbg / Registers and Flags

These are the 8 32-bits registers that we will work with a lot. So it's important to say that modifing a top level register you will modify the low level register. For example if you put any value at AL and then put a new value on EAX the AL value will be overrided. Some of these registers are used in the default way. EBP and ESP for example are used to control the stack frame. Stack frame is a block of memory used to control some local variables of function/method. But you can use the stack when you need to call a method/function you push the parameter onto the stack so the call instructions can grab these parameters to call the method/function. As we analyze artifacts we will see the pattern of it's usage. Because of this it's important to know programming. The good way to learn it's to code and analyze it. On the later topics I will bring much more pratical examples with C language. It's amazing what we can do (I mean, you are managing energy, bro!! lol).

Segment

The segment registers are used to identify where data is located. Each segment register has a pointer to the section where it suposed to grab the data. The segment registers are:

CS: Code Segment.
DS: Data Segment
SS: Stack Segment
ES: Extra Segment
FS: Extra Segment
GS: Extra Segment

So each one is used in specific cases. For example, if you have a address memory on EAX (of data section) register and you have some data on EBX and want to save it on the .data section you maybe see that:

mov ds:[EAX], EBX

So you are moving the data on the EBX to where [EAX] pointing in the DS section. The [] brackets indicating that is a pointer to some memory address.

OllyDbg / Code acessing SS segment

We will see much more in pratical examples. The best way to learn.

Flags

The flags are maintained in a single 32-bits register called EFLAGS and each flag is represented by each bit. So as I said the flags are used to control if the operations that the processor executed worked or not. For example for conditional jumps through the code execution it checks if the Zero Flag (depends on the jump, others can be used) was set so it can know whether it will jump or not. There is an image as you already saw in registers topic. The flags is divided in three groups:

Status Flags
Control Flags
System Flags

I will only discuss about status. If you want to know more, there are books listed in the reference, I do recommend to read. The status flags are used to sign the result of mathematical operations executed by the processor. The flags are:

CF: Carry Flag.
- Carry flag is used to manage the carry or borrow out in mathematical operations. It means that occured an overflow, i. e. there is some remaining data. Used in unsigned arithmetic.
PF: Parity Flag.
- Is set when the result of the operation have sum of the 1's bit even number.
AF: Adjust Flag.
- Used in Binary Coded Decimal (BCD), is set when the result of an operation is a borrow or carry. BCD it's a nice feature to work with decimals, if you want to know more at reference has what you need.
ZF: Zero Flag.
- Is set when the result of an operation is 0.
SF: Sign Flag.
- Is set in the most significant bit, when the operation results in a negative number.
OF: Overflow Flag.
- Is used in signed integer operations and is set when the operation result is too large for positive numbers representation or too small in the negative numbers representation.

Stack

So now we are going to talk about the stack. Stack is very important in the reversing due it's common usage in the assembly world. The stack is a memory area where the program in execution uses to store short-term data. So the stack is generally used to:

Save register values
- You can save the register in the stack to use it for another operation and then retrieve the old value to it.
Store local variables
- Store the local variables at function scope. Like I said before, when doing it the variable is accessed directly using SS segment to access the offset in the stack.
Passing function parameters
- To call a function generally you push all parameters on the stack from the right to left and call the function. For example, f(p1,p2,p3), you will pass p3,p2,p1.
Store the return address of the function call
- The address of the next instruction after the call instruction, doing that the program will know where to back when it executes the retn instruction inside the function called.

Storing locally means that when you enter in a function scope all the local variables generally uses the stack to keep your data. In general everything is on memory, on the address space that the operating system gave us. The stack is just an area on that address space where the program uses to store data. So to put values on the stack you uses push instruction and to retrieve it you uses pop instruction. Everytime you enter a function a stack frame is set. Stack frame is set of addresses reserved to the function in execution, that set is limited by EBP(Base pointer) and ESP ('Top' Stack pointer). Generally this is how you see the stack frame been set:

PUSH EBP
MOV EBP,ESP
SUB ESP, SIZE

First you save the old EBP, for when you back to the previous function will be able to reset the old stack frame, then you set the new base pointer and add the size (how many addresses) you need to that function. Stack are managed as LIFO type, last in first out. So the last element that you pushed onto the stack is the first element that will be poped out. But you can access this data directly if you need it, remember the SS segement ? So you can access using it together with the ESP (points to the top of the stack) register, like:

mov EAX, PTR SS:[ESP+4]

In the last instruction you moved the value of "ESP (plus) 4" pointing at to the general register EAX.

Some important things about stack is that it grows up to lower addresses on IA-32 architecture. So bigger the stack is, lower addresses it access as we can see:

How stack works basically

In the image we can see the heap too and it's our next topic. This post I am bringing a theorical approach but in the next we will discuss it all in a pratical way. For now lets just abstract.

Heap

Heap is a memory area where the program uses for dynamic allocation. The heap is managed generally by the OS. So when the programmer need some space to store data that is bigger than the stack could manage, then the program store it in the heap memory. So the memory heap is passed to the program when the OS is loading it on memory. When you have a literal expression on your code like:

char newText[] = "Testing how code storing data.";

Regardless the data was inside some function the compiler generally uses the .data section and give to the instruction that will use this data the constant immediate address of the data, something like program.ADDRESS. But when you are loading something external for example, you don't know the size you going to need so the size is dynamic, then the program will have a routine that allocate the size you need in the heap.

Heap it's important in RE because many program uses it to allocate data and sometimes identify which routine allocates the memory can be useful.

Now I think you are able to understand the incoming new posts that I will bring here. As I post new topics I will try to discuss a little more on technical topics. On the pratical part we will have a better view.

References

Google
Eldad Eilam - Reversing: Secrets of Reverse Engineering
Reverse Engineering Code With IDA Pro
Professional Assembly Language - Richard Blum

Inversing

Friday, June 23, 2017

Basics of Assembly