Theorem 1: When f(x)=x+1, and g(x,y)=x+y, all k are S-numbers.
Proof: If k < b (base) then the sum of the digits is k, and k is an S-number (pace any definitional issues as to whether numbers < b are eligible for consideration). If k ³ b, the sum of the digits is less than k, and k may be reached by repeatedly adding 1.
Theorem 2: When f(x)=x+n, and g(x,y)=x+y, k is an S-number iff when the value of k modulo n is the same as the sum of its digits modulo n, and k > n
Proof: The sum of the elements in the sequence, modulo n, denoted j, is conserved by the substitutions x®f(x) and x,y®g(x,y). Hence if k modulo n is not j, then it cannot be an S-number. Therefore if k is an S-number, k modulo n=j.
Hence if k is an S-number, it can be written as j+mn. Let i be the sum of the digits. Then j=i-hn, and k=i+(m-h)n. If m-h ³ 0, then k can be reached by applying f(x) to the sum of the digits. As k > i, m-h > 0, and k is an S-number.
If k £ n, then i < k < i+n, and k cannot be an S-number.
Note: When f(x)=x+n, and g(x,y)=x+y whether k is an S-number can be tested using this theorem instead of using the general purpose algorithm, and the expression which gives k can be written out directly.
Theorem 3: When f(x)=x+n, and g(x,y)=x+y the set of S-numbers, in any base, is infinite.
Proof: Follows from theorems 1 and 2.
Whether a particular number is a S-number, in base b, for functions f(x) and g(x,y) can be ascertained by dividing it into digits, and recursively applying the substitutions x®f(x) and x,y®g(x,y) until the value of the expression exceeds the number, whereupon we can conclude from f(x) ³ x and g(x,y)=x+y, that further substitutions must exceed the number. If the value of the expression equals the number we have found a solution, and we can either abandon the search of the implicit tree, or continue on other branches to find other solutions. (For some S-numbers there is more than one solution.)
If f(x)=x we also must terminate the recursion at this point, as otherwise we would have an infinite regress. With the functions I have investigated this only occurs for x=0,1,2.
To avoid the risk of arithmetic overflow, one should precalculate the largest number v, f(v) less than target, and check against this for terminating the recursion, rather than applying f(v) and then comparing with the target. (This is also marginally faster.)
The practical limit of this algorithm depends on how fast f(x) grows, and for typical function, for a search of all numbers up to a limit, the limit is ~104.
It may be observed that given the sequence x,y by substitution we obtain the sequences (x),y and x,(y), and from both of these sequence we obtain the sequence (x),(y). Similar x,y,z leads us to x·y·z via x·y,z and x,y·z. That is the candidate lists, related by substitution operations, form a directed acyclic graph, rather than a tree. We can modify the algorithm to use a hash function to determine when we are revisting a node, and terminate the recursion when we find we are revisting a node.
This modification extends the practical limit of the algorithm to ~105.
It may also be observed that any solution may be expressed as fn(g(l,r)), where l is derived from the left end of the sequences of digits, and r from the right end of the sequence of digits. Furthermore, when investigation numbers in sequence, the same left end sequences are repeatedly evaluated. From these observation is can be seen the caching the results for left end sequences would improve performance.
This is exploited by calculating for each possible pair l,r for a number (a n-digit number has n-1 possible such pairs) a list of expression values less that the target number, and then evaluating fn(g(l,r)) for each pair of values. Let li be the list of values for the i leftmost digits of the target number. This need only be recalculated (with appropriate increase in the maximum expression value excepted to cover all possible target numbers, of that length, with the same i leftmost digits) when these digits change.
This revision of the algorithm extends the practical range to ~106.