Inversing: Reverse Engineering - Didactic point of view ?

Hello folks!

Well, I'm starting this blog to produce content to the community and as a way to study about it. Sorry for any and all english mistakes :). Since I'm starting this blog I thought to bring an introductory text and increase difficulty as I make new ones, anyway I want to bring my parallel studies too. My goal it's generate information to everyone, trying making objective and didactic content. I'll try to translate all my posts from Portuguese to English. My focus is Portuguese, because our community is lack of Portuguese content. I hope you enjoy the content generated here.

If you have malware samples and wish to share with me before trash it, please send it to my e-mail pimptechh@gmail.com. I would be grateful. My intention is to bring the samples analyzed here.

Reverse engineering is a complicated and polemic subject, because evolve many legal questions and for this reason is a theme that must be treated carefully. I will using the concept in software as it is a very embracing and could be referred in any engineered artifact. Basically it is a "deconstruction" of something that already is built. Many of us, human beings, use this knowledge to break software, in general known as "Crackers". Using this power to create a certain chaos on the system. Anyway this is very important process in the industry nowadays, software security, maintenance of software, software performance and so on. It's like a view on how it works and how can we make it better.
In reverse engineering is fundamental to have curiosity, since it is basically from where it arise the idea of reversing things. "Well, how would this was made ?" or "Well, why this is happening ?". A simple debugging, even with the source code on hands, would fit in this process. Trying to find some value or bug in the system. Therefore to understand a reverse process we must primarily have some basis on how this process or artifact was built in first place. If there is a need to revert a software, it's necessary to know at least how build one. Understand how a software works it's indispensable, since before just throw it in the reversing tools we must to know the basis. The fundamentals it's always important, I will try to dissertate the best way that I could to show the importance of the fundamentals by my studies.

Everything began, well I don't really know where it all began (still finding out), but we know the "first" big invent, or something like it of computer models that we have today, the ENIAC. This project already have a little collaboration of Neumann(creator of the actual computer model), helping solving some mathematical logic problem and right after ENIAC they start it's sucessor EDVAC. ENIAC basically did complex arithmetic calculations, although it didn't the concept of memory and programs. In the EDVAC version this concept started to be explored.

The modern computer "started" with John Von Neumann, a exceptional mind and memory. He could recite whole book chapters using exactly the same writed words without changing anything. Even reciting and translating simultaneously from German to English. In 1927 and 1928 published some papers about mathematical fundamentals of quantum theory and probability in quantum statistics, so if you don't believe in quantum mechanics or even doubt it, turn of you computer and throw it away haha. In these papers he demonstrated physicists phenomenal knowledge. The abstraction came from this extremely complex problems in which Von Neumann as was an expert. Clearly he was very ahead of his time. The history it's very interesting, worth read about it. There is a book "The computer and the brain" in which he tries to explain the guesses and concepts for the model designed by him. In this book he tries to do some analogy of our nervous system in a mathematical point-of-view using concepts of statistics and logic. Human body, biological machine.

When we thought about the magic computer makes, it's amazing. Storage energy as the same time storage information and with that information solve problems fruits of intellect. The computer had it's necessity helping solving problems in a much faster way. Making complex calculations in unbelievable speed. Basically you insert an input, this value is stored in memory as energy/tension, this tension is differentiated between 0 and 1. All this stored within your computer and assisted by boolean algebra that generates our output, whatever it is, all depends of the problem you want to solve. When I say "problems" it's necessary an abstraction of what you consider a problem.
I don't know how used to be the conversation between Von Neumann and his mathematical friends, but would be very crazy for sure :P.

We solve a lot of problems every single day by the simple fact that we have stored energy in our computers. One of the programming concepts is here, abstraction, as higher is the layer you are more you abstracts the computation. When you need to talk to someone, all you need to do is choose your IM, send a message and solve the problem(talk to someone), just an example. Don't realizing this makes the paradigm breaking difficult. Math calculations like (2+2), you look at the operands, you realize that is a number 2, stored this information, added more one number 2 and then makes the logic addition (+) operation. You just reads the information stores it and operate it logically, just like and computer do. But how make all of this turn into information ? Rising the representation layers. We have in this case 0 and 1 stored as energy in our computer, this numbers can represent many thins like on and off, full and empty and so on. When we put blocks of 0 and 1 we have a lot of possible ways to represent information like number and letters.

All the information are disposed as 0 and 1, puting all them together forms blocks of information. All depends the way you want to interpret this information. Therefore by been 2 possible values we can assume that is base 2 representation. I'm not super math fan, but I can see the beauty. Anyway, how can we abstracts information from number base 2 like that ? Lets start by decimal base-10. Everybody, I do believe, saw on school about greatness, ways to represent this greatness like 10, 100, 1000 and so on. You implicitly understand these representation like ten, one hundred, one thousand and you can do that because the additional 0 at the end of each greatness. When we put the decimal digit together we are able to count (0,1,2,3.. 10,11,12...), each greatness break we have a new representation.

Is the same with binaries, but in binaries we change 0's and 1's places inside an information block. Here is an example of binary counting:

We can represent in this block-form or we can rise the way of representation to decimal numbers(friendly to us), converting base 2 to base 10 like follows:

Now, how is possible make operations with this primitive way of representation ? With boolean algebra. Inside the CPU have a lot of microcircuits of complex logic in which our information is treated. One of these microcircuits is the Half-Adder and I believe it is the smaller one:

In this circuit runs two inputs (A, B) of tension, "A" can be 0 or 1, the same for the "B" input. Now how this circuit really works ? Inside of microcircuit there is the logical gates like OR, Exclusive OR (XOR), AND, NOT and so on. This circuit have two outputs, sum and carry, like we saw in the previous image. With this we can add two bits like we do with numbers. Lets take this example (5+5=10), in this case sum is the value 0 and the carry is the value 1. The same with bits, e. g. like our Half-Adder we can "sum" two bits 1+1.

Doing that we realize another problem, because we know that the number 2 in binary is "10", right ? The Half-Adder did it's job, now we have to take care of this carry output, for this reason we have the Full-Adder (image below), it can receive three inputs A,B and the additional carry. Do you realize the problem abstraction ? Inside the Full-Adder we have the Half-Adder. Our CPU have a lot of extremely complex abstraction layers or microcircuits within each other, a lot of hours studying to really know all of it, however we will not design microcircuits we just want to know what our computer is made of and how it works, know the fundamentals. If is your desire to learn more, I'd recommend read about computational logic. There is many microcircuits embed in each other and the CPU is an arrangement of extremely complex circuits.

Right, now we have our 0's and 1's represented as information in blocks and been treated by the logic operators, making our machine generate our output. Even converting our information to decimal and after to hexadecimal, we will see it later, we have to understand that computer borned to solve complex mathematical operations and to do so it need to understand "complex" numbers. I'm not a mathematician, so sorry any mistakes. We have topics to see before get into hexadecimals, it's the main way to work with data in our field. Machine needs to interpret negative numbers, for this we have the Signed number system (is good topic to write about it later desmistify Signed and Unsigned in language C). With this system we use a most significant bit (msb) to represent the sign of the number (0)0000001, in this case the decimal number is "1" positive. 0 stands for positive and 1 to negative. Using the first bit or msb we are unable to represent the decimal 0 to 255, with that one bit less we can now represent only -127 to +127, because the msb was cut from our 8bit block. Have the solution and not having it, right ? Because, how do we will operate this logically ? Let's to the example 2 + (-3):

As we can see isn't the desired result, right ? To solve this we have the "Complement". The first version of the complement is the "One's complement", basically we have to flip all the bits in the block for it's opposite like this, "00000010"(+2) to "11111101"(-2). But the problem was solved partially since our result isn't the correct value yet as we can see in the next example:

Just to remember that the carry from the previous operation is carried out of the
8bit result, in the third position from right to left then operation generates an carry that is carried to left until it's is carried out. As we can see the result is -1 of the correct result, would must to be "00000010" but instead we get "00000001". How would be possible to fix this ? If we add 1 bit to this result it would solve our problem. So the "Two's complement" came to help us doing exactly that. Let's see in the next example:

The "Two's complement" came to solve the "One's complement" problem. So let's suppose the same type of operation, but this time we want to do it with a negative number been bigger than the positive. Let's see 4 + (-5):

Well, as we can see in the previous example the value isn't the correct when the number negative is bigger. When this happen it's necessary to take this result and applies the complement on it again, flipping the 0's to 1 and the 1's to 0, then add 1 bit to it and grabbing that last carry to the new result. The last carry will sign our new number. If we do that our new result would be signed "10000001" (-1). The CPU uses flags to do this type of operation, in this case it uses the (S)ign Flag, but let's talk about it in another text.

Let's proceed to the base-10 conversion that is the number notation we use on daily basis. However we will see base-16 too and this particularly is very important since we use it a lot in reverse engineering. But before we do that it's important to know the bit-block sizes or rotulation it's very used on RE too. Each bit-block have a size or "name" to differentiate how long is the information inside of it. Let's see in the next example:

4 bits = 1 Nibble

8 bits = 1 Byte

16 bits = WORD

32 bits = DWORD

64 bits = QWORD

Now with hexadecimals we can encapsulate a larger quantity of bits in only one digit of hexadecimal. Let's take the block "1111" as example, in decimal the value is "15". Converting this previous value to hexadecimal we have "F", just it, one digit of hexadecimal and with two digits we can represent 8bit values. Let's see an example how count in hexadecimal:

There is many sites to convert hexadecimals, some examples:

http://string-functions.com/hex-string.aspx
http://string-functions.com/hex-decimal.aspx

To do all this we need to store all that information to work with it. In nowadays the way to do this is using Memory RAM. Memory RAM stands for Random Access Memory and are cells that stores energy/tension or information, call as you want, this type of memory are volatile, i. e. when the energy power is cut all the memory is lost. This mean that the CPU have to check this memory every cycle to maintain the energy already stored there, avoiding energy leaking.

So, now we have the energy stored, the CPU working logically with information, but how these information flows through the computer ? Well, to do this job we have the BUS of data. When we think about BUS of data it's necessary to abstracts the term to make it easy to understand. Just imagine a tunnel where all 0's and 1's passes freely. But how we manage this information passing through the channel ? I mean, how to address origin and destination of data ? For this we use the Multiplexer and Demultiplexer. These are circuits that manage from where to where, they pick the required information from the tunnel. It's like encode and decode the information. It's a complex topic, so if you interested I recommend to read about it. In the reference listed there is some good information.

So far we know how the computer interacts and how it operates all the information stored in it, however I would like to remember that all this I'm writing here is very basic, our computer is much more complex than that. Anyway the fundamentals is always the same, information is the same, bits are the same, how information is stored remains the same. I see this text as a way to think about computation. So everything is possible! We just have to visualize knowing how it works. Ok, how to interact with this incredible machine ? The answer is OS (Operating System) he manage everything for us, all the memory, all the low-level part of the machine. So everything passes through the OS, all the tasks uses the OS to do something. In reverse engineering we have to keep this in mind, you want to figure how something works ? First figures how it will be interact with the OS, starts from there. The OS encapsulates all the methods to manage your machine in different layers of abstraction. Want to know how it works ? Look for OS internals books. Worth the time reading I guarantee.

Well, now we know how the machine operates and what helps us to operate it. In the user layer of OS we already are in the tops of the abstraction layer everything is prettier and easier visually. So do the programming languages, knowing how programming it's essential. Many people may say that you don't really know how programming, but I ensure you that everything it's more easier when you know how programming. Our entire computer do what we say because of it. The only language computer know is machine language with all it's operators and operands. Any language that you use turns into machine language in the end. Nowadays we have what we call High-Level languages, they are called like that because your commands are close to our formal language in "real" life.

There many types of high-level languages, but there are two of them that is most used. Java and C# they are high used in daily basis. It let us solve problems in a much faster way, but that not implies performance because of encapsulation and abstraction. Higher you are in abstraction layer more technology you have under you, this means many things that you can't control. In general these languages runs within a virtual machine, so in this case the OS doesn't interact direct with your program, but instead with their virtual machines. Let's see an example of languages life cycle:

High-Level => Compiler =>Assembly => Machine Language

As we could see this cycle stands for languages that are directly compiled. Java and C# doesn't runs directly, instead they use a virtual machines like I already said. The cycle I've shown it's for languages like C/C++ that are compiled and runs directly with OS, there is no virtual machine between them. At low-level layer we have the assembly language that in fact is the machine code, but translated to assembly. Assembly language basically are the mnemonics from real machine language. Java have the JVM (Java Virtual Machine) and the C# have the CLR (Common Language Runtime), they both are virtual machine that interacts with the OS. These languages doesn't generates machine code, instead they generate bytecodes and only can be interpreted by their virtual machines. So to reverse it it's a little different, more easier I would say, because bytecode is a type of machine code but in high-level and more readable.

In reverse engineering we must be conscious about fundamentals, because we work directly with machine code/assembly. And this language works directly with the machine, with memory, deal directly with CPU and it's architecture like any high-level language but with a little different concepts. When you change the language you have to know how iteract with it, same with assembly and CPU architecture, but in general operations are the same. "Any" language have loops, variables, constants and so on. The same is for CPU and it's architecture. So I freeze that is important to programming and if you want to go down this path, you have to learn C/C++. If my opinion worth something, go for C :). When you understand low-level world well and how to interact well with C, then you should maybe migrate to C++ or don't hehe. Anyway OS is the key to reverse anything that is interacting with it. Let's see the below image and try to abstract your machine after read all this text:

Operating System | Third Layer
|||
Instruction Set Architecture | Second Layer
|||
Microarchitecture (Hardware) | First Layer

Conclusion

Well, this is my first text about the introductory part of reverse engineering, I hope that could help anyone. It's a magical world, in fact very wonderful world! The question is how far down the rabbit hole do you want to go ? I want to keep this blog always updated, trying to help the community to grown as I do. I will try to be most objective and didactic as possible to help everyone to understand the concepts. In the next texts I wish to bring more technical stuff like, malware analysis, solving puzzles, programming assembly/c and so on. Feel free to ask and criticize. My only goal is to learn and help the community. Let's share, let's grown together.

Regards from Pimptechh and nice studies.

References:

Google
Andrew S. Tanenbaum - Structured Computer Organization
Eldad Eilam - Reversing: Secrets of Reverse Engineering
Richard Blum - Professional Assembly Language (2005)

Inversing

Sunday, June 4, 2017

Reverse Engineering - Didactic point of view ?

No comments:

Post a Comment

Windows Objects