Yes I know you're probably sick of this by now, but I feel that if we are developing layouts for English, then the input texts should be as close to generalised English as possible.
Was not happy with this, and found out why when I tried whittling the list down to 10 or 15. Did it in stages and noticed the order jumping around too much, so modified program to scale the four test results in range 0 - 1, then took average of each input text, high average wins. Also pushed the "length" in Jaro-Winkler to 28, effectively matching against
" etaoinsrhldcumpfg.ywb-,v0k1" (sans quotes).
Results attached. Putins-resistance is "part of the resistance" text pasted onto bottom of Putin's speech text. The combo is better than each individually.
Now been scratching my head about a corpus for code, and getting nowhere.
Current code tests are various sample tasks in an assortment of languages, borrowed from RosettaCode, as well as Google home page (not real world typing), plus Keyboard Layout Editor single-page app.
I think we should be using samples from the most-used languages. Problem is determining which those are. Our local tech website regulary publishes a pair of lists of "top" languages, but we're all sceptical of those lists because of how they are generated. Current lists here:
https://mybroadband.co.za/news/software/274403-python-climbing-most-popular-programming-languages-list.html and screengrab attached.
StackOverflow did their own survey, results here:
https://insights.stackoverflow.com/survey/2018/ , scroll down to Most Popular Technologies, which gives what is probably a more real-world list.
The part that worries me about these is the absence of things like COBOL and Fortran and possibly even Ada ... and I suspect they are missing because programmers in these languages don't hang out on StackOverflow or need to run Google searches on how to sort an array ... they already know what they're doing.
So I think I will go with the top end of the StackOverflow list, down to "C". Typescript is very similar to Javascript which is first.
Started poking around for samples on GitHub, then the "James Bond" problem reared its head again.
1. pretty comment separators eg /* ------------------------------------------------------------------------------*/
which are going to mess up the character distribution analysis.
2. for example, in CSS, repeated use of various phrases, eg
.hvr-pulse-shrink:hover, .hvr-pulse-shrink:focus, .hvr-pulse-shrink:active {
-webkit-animation-name: hvr-pulse-shrink;
animation-name: hvr-pulse-shrink;
-webkit-animation-duration: 0.3s;
animation-duration: 0.3s;
-webkit-animation-timing-function: linear;
animation-timing-function: linear;
-webkit-animation-iteration-count: infinite;
animation-iteration-count: infinite;
-webkit-animation-direction: alternate;
animation-direction: alternate;
}
which introduces a surplus of w, b and k, which are less-common letters normally, as well as "-".
or repeated use of variable or class names, eg
.token.property,
.token.tag,
.token.boolean,
.token.number,
.token.constant,
.token.symbol,
.token.deleted {
color: #905;
}
or
def parse_args():
parser = argparse.ArgumentParser(description='Run Electron tests')
parser.add_argument('--use_instrumented_asar',
help='Run tests with coverage instructed asar file',
action='store_true',
required=False)
parser.add_argument('--rebuild_native_modules',
help='Rebuild native modules used by specs',
action='store_true',
required=False)
parser.add_argument('--ci',
help='Run tests in CI mode',
action='store_true',
required=False)
parser.add_argument('-g', '--grep',
help='Only run tests matching <pattern>',
metavar='pattern',
required=False)
parser.add_argument('-i', '--invert',
help='Inverts --grep matches',
action='store_true',
required=False)
parser.add_argument('-v', '--verbose',
action='store_true',
help='Prints the output of the subprocesses')
parser.add_argument('-c', '--configuration',
help='Build configuration to run tests against',
default='D',
required=False)
return parser.parse_args()
So the various samples chosen are going to skew the character distribution in different directions, the only way around that is to have a large number of source projects and only take small bits from each, and hope that it all balances out in the end.
Am open to better ideas at this point ....
Thanks, Ian