Creating Interpreters?

Deleted member 219079 · Jan 16, 2017

I'd like to have some documentation on interpreters, namely on systems where there's compiler and interpreter (that is, compiler takes code written by user and turns it into bytecode and interpreter executes this).

LordDz · Jan 16, 2017

Basicly they read the code while the program is running, opposite of running a compiler which compiles the code.
Java and Python reads the code while it runs, meanwhile C# or C++ compiles it.

But in general, nothing you have to think of when coding.

Interpreter (computing) - Wikipedia

Deleted member 219079 · Jan 16, 2017

Oh, Python interpreter takes user-written input? I don't actually want that.

However, this intrigues me:

Wikipedia said:
3. explicitly execute stored precompiled code made by a compiler which is part of the interpreter system.

So there'd be compiler and interpreter.

GhostWolf · Jan 16, 2017

Someone linked somewhere recently this video of how to write a virtual machine, and I found it very interesting:

This assumes you already have bytecode though, so you should probably find a good video on how to convert an arbitrary language to bytecode (for instance, there are many tools that do this with ASTs, see tools used for syntax highlighting and such - it's the same job).

As to Python, it can take both the high-level language, and compiled bytecode files, which you can generate with the Python tools.
It's the same thing, capability-wise, you just decide when to do the bytecode generation.
For example, when writing code, you really don't want to have that extra compiling step on every tiny change, because then you're back to the annoyance of C++.
When your code is finished, that's when you want to compile it and get it as fast as possible.

If you look at languages like JS, you can see that starting off with the high-level language can be quite damn fast too, but JS virtual machines are ridiculously complex beasts (and rightfully so - it's the most used language in existence).

Deleted member 219079 · Jan 16, 2017

GhostWolf said:
Someone linked somewhere recently this video of how to write a virtual machine, and I found it very interesting:

Wow he explains it in a very simple manner, thanks.

Think that's /thread (that is if I don't get stuck in my own implementation...)

Dr Super Good · Jan 17, 2017

LordDz said:
Basicly they read the code while the program is running, opposite of running a compiler which compiles the code.

A compiler takes code in one language and translates it to code in another language, usually more machine friendly than the original. An interpreter enables the execution of code on hardware lacking native support for that language. To emphasize the difference, a compiler can only translate code where as an interpreter can only run code. Many interpreters make use of a compiler internally for performance.

LordDz said:
Java and Python reads the code while it runs, meanwhile C# or C++ compiles it.

Java and Python pre-compile the human friendly code into more machine friendly java/python bytecode. This bytecode can then be interpreted by the java/python virtual machine. The java/python virtual machine might just in time compile the java/python bytecode into more efficient native machine code which does not need an interpreter to run.

Most C# implementations work the same as java/python from what I can tell. They produce C# bytecode (cli code?) which is then interpreted by the .NET system. In theory C# could use a compiler to translate into machine code however I am not aware of any such implementations and generally they would be counter productive to the aims and goals of the C# language.

C and C++ are languages that are generally compiled with a compiler into a packed machine code format intended for native execution on hardware.

However things are not that simple. There do exist C and C++ code interpreters used to evaluate single statements, such as for command line use. There also do exist Java hardware processors which have java bytecode as an instruction set so have no need of Java interpreters or just in time compilers. There also exist software emulators with an interpreter that can take packed machine code from a compiler and run it. You also get compilers which take one human friendly language and turn it into another, such as vJASS compiling into JASS and the original C++ (not modern versions) compiling into C.

Then you get hardware itself... Modern x86 processors have sort of a hardware interpreter (if you can call it that) for some x86 instructions (which come from x86 machine code). It is configured using microcode and translates the instruction into physical execution logic. The reason this exists is to make the processor mechanics more RISC like as opposed to the CISC nature of x86.

jondrean said:
Oh, Python interpreter takes user-written input? I don't actually want that.

Python is designed to be more flexible than Java. Where as Java still has the traditional program->compile->distribute work flow, Python does not need that and instead you can distribute source code files directly. The language treats both source code and byte code files very similarly to the point that the difference is transparent. Often when a Python program is run a bunch of compiled python bytecode files will appear as the interpreter itself prefers to run python bytecode due to improved performance and easier design.

Do note that the Python interpreter is not the reason the language is so slow compared with Java and C++. Python is slow because of the memory model it uses, where named elements are stored in a hierarchy of mapping objects (dictionaries I recall, been a long time so might not be right). In order to implement this memory model on most computer architectures, every time such a named element is used the interpreter has to resolve it to data in physical memory using mapping data structures such as hashtables which have significant overhead. This is also the reason why JASS is so slow in Warcraft III as the JASS bytecode interpreter does something very similar, although probably less efficiently. Java and C/C++ use a memory model that is more efficient on most architectures as it is generally lower level and more directly compatible.

jondrean said:
So there'd be compiler and interpreter.

It is often needed for the interpreters to have viable realtime performance. It also can allow for a more modular design as the pre-compiler component and language specific bytecode interpreter can be written separately. One can also add support for reading the intermediate bytecode format from a file directly brining better code loading performance as well as smaller distributable size.

A lot of modern interpreters work by taking non-native code and using a compiler to convert it into native machine code in a process called just in time compiling. This is best seen in Java which is how it can achieve performance similar to C/C++ for a lot of tasks. The most practical example would be in modern emulators like Dolphin (gamecube/wii), and the semi-functional PS3, Xbox 360 and Wii U emulators where without JIT compiling getting real time performance from the interpreters is near impossible.

Deleted member 219079 · Jan 17, 2017

This has turned out to be more fun than I expected. I will try to keep Win32 code at minimum so I could port my ghetto VM to Linux as well. I might post it to GitHub someday.

That fellow in the video doesn't use pointers? For example for addition: *--sp += sp[1];.

Dr Super Good said:
A lot of modern interpreters work by taking non-native code and using a compiler to convert it into native machine code in a process called just in time compiling. This is best seen in Java (...)

If only Java was less class centric like Python. Or Python was faster.

Edit: Currently:

Code:

int code[] = {
        OP_ICONST, 2,
        OP_ICONST, 3,
        OP_IMUL,
        OP_PRINT,
        OP_HALT
    };

Getting there

Dr Super Good · Jan 17, 2017

jondrean said:
If only Java was less class centric like Python. Or Python was faster.

Python supports classes as well. Python is an object orientated language. Just unlike Java you are not forced to use objects or types for everything.

GhostWolf · Jan 17, 2017

With all due respect, you are not likely to write anything faster or better than Python, nor is Python slow (and if you want that extra speed, there are different faster branches than CPython, such as PyPy, and actually the same goes for many languages, like Java, Ruby, etc., that have multiple branches developed over time, each with its own strengths and weaknesses).

If you are doing this for speed, I highly suggest you to stop right here.

Making a simple VM seems like a fun project, however, so if that's the reason, good luck.

On a slightly less relevant subject, Python is probably more class oriented than Java. It just doesn't stick it in your face in every line of code you write, which is nice (addendum: I hate Java).

Deleted member 219079 · Jan 17, 2017

I wouldn't use interpreter for speed. I guess just in time compiling is what you're talking about? I don't know about that yet.

Python is slow, I've done some searching today (for example the dictionaries mentioned above). I have other motives behind using it (nice to use with Linux).

Now I'm looking for ways to have the VM run machine code. I guess it's MapViewOfFile under Windows.

GhostWolf · Jan 18, 2017

What is "fast" and what is "slow"? Dictionaries/hashtables/whatever you want to call them, are used EVERYWHERE. Almost every interpreted language is based around them. All of the code running the whole world wide web runs around them. All of your programs use them extensively. Every programming language gives an implementation (or multiple ones), and everyone uses them. Everything in the world of computer science uses them.

But no, DSG wrote some crap yet again, so I guess everyone else in the world is wrong.

For crying out loud, I guess it will never end, it just moves from thread to thread endlessly.

I am out, I had enough of this bullshit. Can't handle this site.

Dr Super Good · Jan 18, 2017

jondrean said:
Now I'm looking for ways to have the VM run machine code. I guess it's MapViewOfFile under Windows.

That can be used for native code style dynamic loading. Virtual memory pages containing code will be dynamically fetched from the file as the code is being read.

Do be aware this places restrictions on the code file format. The code files must be stored in a way that supports memory mapping. The code itself must be ready to use as is, so cannot be compressed. The code cannot make assumptions as to where it will be positioned in virtual memory when executed as applications have limited control over this.

Personally I would recommend avoiding memory mapping for now, unless you want to use it for experience or to try it out. The simplest way to load code from a file would be to allocate a buffer the length of the code and read into the buffer the code file using normal I/O functions. This would also be a lot more portable as memory mapping functionality is almost always platform specific and might not be available at all. All the benefits of memory mapping only really come into play for very large program files.

GhostWolf said:
What is "fast" and what is "slow"? Dictionaries/hashtables/whatever you want to call them, are used EVERYWHERE. Almost every interpreted language is based around them. All of the code running the whole world wide web runs around them. All of your programs use them extensively. Every programming language gives an implementation (or multiple ones), and everyone uses them. Everything in the world of computer science uses them.

They are used everywhere, but generally not for performance critical situations or are used for convenience when they really should not. In the case of typical webpages the page load time is dominated by transmission rate rather than code execution speed and once the page is loaded it still uses trivial system resources. Other times people use them because they are familiar or widely available but other languages would be a better choice.

GhostWolf said:
Every programming language gives an implementation (or multiple ones), and everyone uses them.

However a lot of programming languages result in code which is not based around them and instead is more based around how the hardware works. Most hardware does not support map based structures directly, with them instead having to be implemented using structures like hashtables which work efficiently with addressable memory. The result is that languages with memory models that force map usage (can only perform limited optimizations on removing them) execute at least an order of magnitude slower.

This does not matter for high level code. However if writing low level code the performance difference can be huge. For example it is completely fine to use Python for general video processing however the functions which manipulate video image data, such as encoders, should be implemented in more native code otherwise it could easily take 10 or more times longer to run.

GhostWolf said:
Everything in the world of computer science uses them.

Computer science covers a broad range of subjects. Some such as HCI design, project management or computer ethics have little to do with mapping data structures. Others which focus on hardware are below the level where most mapping data structures are implemented.

Deleted member 219079 · Jan 18, 2017

I want to let the user use the real stack, I assume it goes like this? (I barely know what ASM means so...)

Code:

    case OP_ICONST:
        __asm {
            sub esp, 4                  ; get stack
            mov eax, dword ptr [edx]    ; get const int
            mov dword ptr [esp], eax    ; set the const int
            add edx, 4                  ; increment ip
            jmp ite                     ; go to ite
        }

My function is like so: static int __fastcall _run(int * ip);
So it puts ip to edx.

Dr Super Good · Jan 18, 2017

Allocate your own stack in memory? That is basically what most stacks are (exception being some micro processors).

Deleted member 219079 · Jan 18, 2017

Well, can I make esp point to it then and use PUSH / POP?

Creating stack whose size user can modify sounds cool yea.

Edit: Using ASM and C together is quite cumbersome, I will get rid of the C switch structure for the starters..

Dr Super Good · Jan 18, 2017

jondrean said:
Edit: Using ASM and C together is quite cumbersome, I will get rid of the C switch structure for the starters..

Why are you using assembly? The interpreter can be written in C/C++. If you make a JIT compiler then you output machine code directly (like a C/C++ compiler) which you jump execution to.

Deleted member 219079 · Jan 18, 2017

I can see from VS's disassembly tab that the C code does god-knows-what. I want to know what the loop is doing but the compiler hides it from me. So I will do the loop in ASM.

This is a good opportunity to learn assembly anyway! I'm downloading Ubuntu with which I will take my time to learn assembly.

Dr Super Good · Jan 19, 2017

jondrean said:
I can see from VS's disassembly tab that the C code does god-knows-what. I want to know what the loop is doing but the compiler hides it from me. So I will do the loop in ASM.

The compiler probably knows how to produce better machine code than you do. That said why do you even need to know what machine code it produces? Optimizing the interpreter should be one of the last stages, and in most cases will be done by adding functionality like JIT compiling rather than optimizing the execution of individual instructions.

jondrean said:
This is a good opportunity to learn assembly anyway! I'm downloading Ubuntu with which I will take my time to learn assembly.

Do you really want to learn assembly? It is highly logical but so low level its hard to understand what is going on. Hence why practically nothing is written in assembly anymore, even code for micro processors is all C or even C++.

Deleted member 219079 · Jan 19, 2017

It's a matter of *knowing* what on earth is going on rather than competing in performance with the compiler.

I want degree in computer science, learning assembly seems appropriate. Besides, wouldn't I need to know it to implement JIT anyway?

Edit: Well I've been trying to install several Linux distros for 5 hours on my desktop PC and gave up. Maybe some year I get hang of it.

Deleted member 219079 · Jan 21, 2017

For future reference if someone googles this or something, the arbitary code execution could just be done by dynamic library linking. (Enbuffered machine code DSG suggested would require knowing the calling convention.) Example from MSDN:

Code:

typedef void (WINAPI *PGNSI)(LPSYSTEM_INFO);

// Call GetNativeSystemInfo if supported or GetSystemInfo otherwise.

  PGNSI pGNSI;
   SYSTEM_INFO si;

   ZeroMemory(&si, sizeof(SYSTEM_INFO));
 
   pGNSI = (PGNSI) GetProcAddress(
      GetModuleHandle(TEXT("kernel32.dll")),
      "GetNativeSystemInfo");
   if(NULL != pGNSI)
   {
      pGNSI(&si);
   }
   else
   {
       GetSystemInfo(&si);
   }

You need to use LoadLibrary prior to using GetProcAddress.

So this project turned out way shorter than I thought lol.

Such an useful experience.

Dr Super Good · Jan 21, 2017

To call DLLs you need to know the calling convention still? Although I think all DLLs use the same convention...

The only thing stopping you from simply calling a function pointer to a buffer in the heap is that unless the memory pointed to is in a virtual memory page marked as "execute" it will throw a segmentation fault of sorts. This security feature of virtual memory has heavily hindered arbitrary code execution exploits. However one can use OS specific functions to allocate an address rage of execute pages, write data to them and then call or jump to an address inside that range like normal.

Trivia, Age of Empires 2 was written before hardware accelerated graphics became mandatory in computer systems. In order to obtain their iconic and cutting edge graphics they required terrain blending. In order to do this efficiently they created a terrain blending bytecode format which was then compiled (JIT like) into hard-coded blending machine code which could be run with unprecedented performance and produce terrain blending results far more efficiently on hardware of the day than the generic all purpose hardware accelerated real time blend algorithms we use today. These mechanics remained a mystery to modders for over 15 years until a former developer documented that it worked by taking a terrain blend image, pre-compiling (for space and load time) into a bytecode like format that is then at run time converted into machine code to perform the blends. This has to be one of the most obscure programming like languages ever made as it used an image as source code and was not publically documented!

LordDz · Jan 22, 2017

If you think you need to learn assembly to make your programs go fast, you're doing something wrong in your code.

Dr Super Good · Jan 22, 2017

LordDz said:
If you think you need to learn assembly to make your programs go fast, you're doing something wrong in your code.

He already said he was only doing it to learn assembly.

One needs to understand assembly to some extent as assembly is generally the lowest human readable programming language one can get. Sure one could use a hex editor to look at byte code but that is not really human readable. In most cases 1 instruction translates to 1 line of assembly, hence it is a good way to understand how instructions work etc. Although useless for most programming tasks, know how instructions work is critical to compiler development.

Geries · Jan 23, 2017

you can allocate memory for cpu instructions with this:

C++:

SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
const _PAGE_SIZE = sysinfo.dwPageSize;

size = size + _PAGE_SIZE - 1;
size = size - size % _PAGE_SIZE;

void* memory = VirtualAlloc(nullptr, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

and execute it with inline assembly jump, or casting void* to a function and executing that(pay attention to call convention), whatever that you feel like. Ofc this is if you just have raw cpu instructions which you want to execute.

Creating Interpreters?

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Similar threads