Introduction to Assembly
Hello everyone, it's me again bringing some basic stuff. Even being a basic stuff I hope that can help anyone. I am making this basic posts to my incoming ones, I will bring more technical analysis. Sorry for any mistakes that I maybe did on this post and if I did any, please send me a feedback.Well, the assembly code it's the "only" way on reversing engineering. Or you interpret assembly mnemonics or you analyze opcodes (machine code). As said in the previous introductory topic (if you didn't read, I do recommend) the assembly code it's generated by the compiler for every code language that you use. So, understanding it is vital to RE. Assembly language doesn't is standart for all assemblers, the mnemonics(instructions) are different deppending of your processor architecture. Since the processor is made of several circuits and each processor has your own cricuits, the logic behind it's processing differs from each other. So the instructions that they "understand" is different from each processor architecture. In this blog I will give the approach of IA-32 architecture. I will not "teach", is more like an approach of what it is and how we work with. So knowing how the computer "works", as Memory-Data X Instruction-Operation. Logically the computer stocks all data on memory and all this data is interpreted and executed by the CPU.
Computer / Memory, CPU, IO Devices and BUS
The processor have some pointers to help the CPU keep track of what it need to do with what data. Then we have the instruction pointer and the data pointer. Through the posts we will see it in practice. So the instruction pointer points to the memory block that represents an instruction and the data pointer to a memory block that represents the data, then the CPU executes the instruction with the data that is pointed.
CPU Units X Memory
These pointers allocate the memory offsets in registers. As the processor executes these instructions the instruction pointer goes to the next instruction and the data pointer too. The instruction has between 1-3 bytes and is called opcode (operation code). So basically assembly has three "parts", opcode mnemonics, data sections and directives.
Opcode
OllyDbg / Right Assembly, Left OpCodes
Data
Directives
Sections
- .text:
- All the code instructions are alocated in this section. No data is allowed here, except some fixed data of variables like
a = 5
, but that depends of the programmer on low-level programming and depends of the compiler on high-level programming. - .data:
- This section is responsible for stores all data that the .text request. It will be referenced as an address in the .text section, like program.ADDRESS.
- .bss:
- This section generally used to unitialized data. I think the name might change from language to language, don't sure.
The IA-32 Architecture
The IA-32 architecture was designed for pentium processors by Intel. I don't know for sure if it is the most used nowdays, but it's a very known one and have a lot of documentation about it. When you learn assembly basis in one architecture, makes easy to learn in another one, because the base still the same. Some assembly let you do more others let do less, I think it's the basic difference. In my last topic I tried to wrote about computer in general form and I think that you already know how it is. I don't want to make this part too much extensive. So basically IA-32 is divided in 4 parts:
- Control unit:
- Control is responsible for bring all the information from memory, data and instructions. Then it decodes this instructions into micro-operations and pass to execution unit. The result of the operation is passed back for control unit that stores the result.
- Execution unit:
- Responsible for execute all the micro-operations.
- Registers:
- Registers are responsible to keep track of data that are been used. This registers are internal memory of the processor. Having this little memory inside the processor make it much more faster than going outside (from processor itself) searching data in RAM memory and retrieve it.
- I want to keep it objective so I will not write about all of them. You can read in the book listed on the reference, I do recommend. So there are basically four types of registers (there are more like I said):
- General Purposes: 8 32-bits registers. They are used to work with data.
- Segment: 6 16-bit registers. Used to memory access.
- Intruction Pointer: 1 32-bit register. Points to the next instruction to be executed.
- Floating-point: 8 80-bit register. It used to work with floating-point numbers.
- Flags:
- Flags are used to keep control of the operations executed by processor. It's a way to know if some operation worked or not. There are specific type of flags to specific operations. We will see it.
Registers
General Purposes
- EAX (32-bits):
- AH (16-bits):
- AL (8-bits)
- AH (16-bits):
- EBX (32-bits):
- BH (16-bits):
- BL (8-bits)
- BH (16-bits):
- ECX (32-bits):
- CH (16-bits):
- CL (8-bits)
- CH (16-bits):
- EDX (32-bits):
- DH (16-bits):
- DL (8-bits)
- DH (16-bits):
- EDI (32-bits):
- DI (16-bits)
- ESI (32-bits):
- SI (16-bits)
- EBP (32-bits):
- BP (16-bits)
- ESP (32-bits):
- SP (16-bits)
OllyDbg / Registers and Flags
These are the 8 32-bits registers that we will work with a lot. So it's important to say that modifing a top level register you will modify the low level register. For example if you put any value at AL and then put a new value on EAX the AL value will be overrided. Some of these registers are used in the default way. EBP and ESP for example are used to control the stack frame. Stack frame is a block of memory used to control some local variables of function/method. But you can use the stack when you need to call a method/function you push the parameter onto the stack so the call instructions can grab these parameters to call the method/function. As we analyze artifacts we will see the pattern of it's usage. Because of this it's important to know programming. The good way to learn it's to code and analyze it. On the later topics I will bring much more pratical examples with C language. It's amazing what we can do (I mean, you are managing energy, bro!! lol).
Segment
- CS: Code Segment.
- DS: Data Segment
- SS: Stack Segment
- ES: Extra Segment
- FS: Extra Segment
- GS: Extra Segment
mov ds:[EAX], EBX
So you are moving the data on the EBX to where [EAX] pointing in the DS section. The [] brackets indicating that is a pointer to some memory address.
OllyDbg / Code acessing SS segment
We will see much more in pratical examples. The best way to learn.
Flags
The flags are maintained in a single 32-bits register called EFLAGS and each flag is represented by each bit. So as I said the flags are used to control if the operations that the processor executed worked or not. For example for conditional jumps through the code execution it checks if the Zero Flag (depends on the jump, others can be used) was set so it can know whether it will jump or not. There is an image as you already saw in registers topic. The flags is divided in three groups:
- Status Flags
- Control Flags
- System Flags
I will only discuss about status. If you want to know more, there are books listed in the reference, I do recommend to read. The status flags are used to sign the result of mathematical operations executed by the processor. The flags are:
- CF: Carry Flag.
- Carry flag is used to manage the carry or borrow out in mathematical operations. It means that occured an overflow, i. e. there is some remaining data. Used in unsigned arithmetic.
- PF: Parity Flag.
- Is set when the result of the operation have sum of the 1's bit even number.
- AF: Adjust Flag.
- Used in Binary Coded Decimal (BCD), is set when the result of an operation is a borrow or carry. BCD it's a nice feature to work with decimals, if you want to know more at reference has what you need.
- ZF: Zero Flag.
- Is set when the result of an operation is 0.
- SF: Sign Flag.
- Is set in the most significant bit, when the operation results in a negative number.
- OF: Overflow Flag.
- Is used in signed integer operations and is set when the operation result is too large for positive numbers representation or too small in the negative numbers representation.
Stack
- Save register values
- You can save the register in the stack to use it for another operation and then retrieve the old value to it.
- Store local variables
- Store the local variables at function scope. Like I said before, when doing it the variable is accessed directly using SS segment to access the offset in the stack.
- Passing function parameters
- To call a function generally you push all parameters on the stack from the right to left and call the function. For example, f(p1,p2,p3), you will pass p3,p2,p1.
- Store the return address of the function call
- The address of the next instruction after the call instruction, doing that the program will know where to back when it executes the retn instruction inside the function called.
push
instruction and to retrieve it you uses pop
instruction.
Everytime you enter a function a stack frame is set. Stack frame is set of addresses reserved to the function in execution, that set is limited by EBP(Base pointer) and ESP ('Top' Stack pointer). Generally this is how you see the stack frame been set:PUSH EBP
MOV EBP,ESP
SUB ESP, SIZE
First you save the old EBP, for when you back to the previous function will be able to reset the old stack frame, then you set the new base pointer and add the size (how many addresses) you need to that function. Stack are managed as LIFO type, last in first out. So the last element that you pushed onto the stack is the first element that will be poped out. But you can access this data directly if you need it, remember the SS segement ? So you can access using it together with the ESP (points to the top of the stack) register, like:
mov EAX, PTR SS:[ESP+4]
In the last instruction you moved the value of "ESP (plus) 4" pointing at to the general register EAX.
Some important things about stack is that it grows up to lower addresses on IA-32 architecture. So bigger the stack is, lower addresses it access as we can see:
How stack works basically
In the image we can see the heap too and it's our next topic. This post I am bringing a theorical approach but in the next we will discuss it all in a pratical way. For now lets just abstract.
Heap
char newText[] = "Testing how code storing data.";
Regardless the data was inside some function the compiler generally uses the .data section and give to the instruction that will use this data the constant immediate address of the data, something like
program.ADDRESS
. But when you are loading something external for example, you don't know the size you going to need so the size is dynamic, then the program will have a routine that allocate the size you need in the heap.Heap it's important in RE because many program uses it to allocate data and sometimes identify which routine allocates the memory can be useful.
Now I think you are able to understand the incoming new posts that I will bring here. As I post new topics I will try to discuss a little more on technical topics. On the pratical part we will have a better view.
References
- Eldad Eilam - Reversing: Secrets of Reverse Engineering
- Reverse Engineering Code With IDA Pro
- Professional Assembly Language - Richard Blum