Advantages of regular languages for learning

sethmachine · Feb 7, 2014

Hi,

Does anyone have an idea or set of references for cognitive properties of regular languages?

I've got a problem here. Suppose you are given a finite number of strings from an infinite language, and after X amount of time, you are supposed to have recovered that language completely (i.e. you can build the appropriate FSA). Furthermore, you are never told directly what strings are not in the language, as your input is only those strings in the language.

Now is this problem solvable at all--can you come up with a learner which eventually converges on the infinite language given a finite amount of input and finite amount of time? Or do you need to have some assumptions about the language a priori, since there are an infinite number of FSAs which accept the input strings, yet possibly only a handful actually captures the generalizations of the actual target language. Remember your final FSA cannot accept strings which are not in the language (otherwise you have failed). But your FSA needs to also accept those strings which are in the language, but you never had as input (since it's infinite, you could not have seen every string).

Edit: If you believe the problem is not solvable without some clever inputs or a priori assumptions by the learner, please give a brief proof. Here's mine.
Since we can never sample every string in the language, it is possible there is a string which has symbols we have never seen. Therefore it is impossible for any FSA to capture it, since we never had those symbols added to our alphabet. Even if we did have all the symbols, we could have had very unlucky input, and never had samples of a certain pattern. If we were given a uniform sampling of all the strings, would these two issues still arise?

If we're assuming the Chomsky hierarchy of languages, would it make sense a learner would look first for regular patterns, and then only go to more powerful expressions (context-free, etc.) when evidence from the input suggests non-regular patterns?

What does it mean for a language to be regular in terms of cognition, as compared to say a context-free or context sensitive language? When would it be ideal to assume the input language is actually context-free or context-sensitive (what kind of evidence would you need?).

peq · Feb 8, 2014

Your intuition, that the problem is not solvable is correct, but the arguments in your proof are a little bit fuzzy.

I would just explain it with a counter example. Suppose you are given the samples {"a","aa"}. Then {"a,"aa"} and {"a","aa","aaa"} are both languages containing all the samples but the two are different languages.

However, it is possible to find approximations of languages using samples. One relevant blog post I have seen recently: http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb
If you search a little bit you probably will find some papers, for example this one: http://www.informatik.uni-trier.de/~fernau/papers/Fer05c.pdf

Waffle · Mar 4, 2014

regular..
well there is such a thing as regex, that is a pattern matching language that is (you guessed it) regular.

latest extensions aside it should give you a feel for what a regular language is.

sethmachine · Mar 5, 2014

Update: So it appears learnability is a property of language classes--there could be a set of regular languages which are not learnable, but a set of context-sensitive ones which are. It seems dependent on the "semantics" of the language class...

I'm aware of what regex in modern programming languages are (and some technically aren't actually regular expressions in the formal definition, i.e. they can recognize more expressive languages than the regular ones).

I am asking what is implied by a language being regular versus context-free? Context-free languages for example are more expressive, but how? What kind of structures are they capable of recognizing (or generating) that requires them to be context-free?

If I have a set of strings that are members of some language L, what might clue me in to say "Hey perhaps these strings were not generated by a regular grammar, but a context-free one?" If we have an oracle (always tells us if a string is or isn't in the language) this is a trivial problem (apply each pumping lemma, and if it fails we know we don't have that type of grammar).

But assuming we don't have an oracle and only some finite (but probably very large) set of member strings, what do we look for? It seems extra challenging because any finite language is a regular language, and there exists a unique FSA for every finite language that exactly recognizes it and it alone.

For example, I could notice this set of strings:

ab
aabb
aaabbb
aaaabbbb

Now, what do I do? Do I generalize and say, "ah this is the classic context-free language a^n b^n?" Or am I super restrictive and just say "oh it's the language of {ab, aabb, aaabbb, aaaabbbb}?" In the former hypothesis I've basically claimed any a^n b^n is a member of L, but in the latter I would not predict say aaaaabbbbb is in L since I did not see it.

Thankfully it probably isn't a very helpful problem to solve anyway (and is probably impossible with 100% certainty without an oracle), but still interesting.

This question is sort of general, but the paper referenced was sort of in the right direction.

Advantages of regular languages for learning

sethmachine

sethmachine

peq

peq

Resources

Waffle

Waffle

sethmachine

sethmachine

Similar threads