Calculate features by averaging across training dataset?

2017-12-10 22:07:02

We have a large number of examples (50 nucleotide long DNA sequences) for one of our machine learning task. For every example, there is the outcome that first is continuous (CON) in the range of 0-1. Since most of the examples are around 1 (about 98%) and about 2% is around 0, with very few examples between the two (that can be discarded), we have separated the labels for two clasess (0 vs 1, CLA) making it a classification and not a regression problem.

However, when calculating features, we have an ongoing debate that bothers me. One of the collaborators would like to do the following to calculate some of the features:

For every 10 long sequence that can be found in the dataset:

get every examples with that 10mer in the 1st position

average all CON of them and assign it to the n-mer in the first position

Repeat for 2nd, 3rd, etc position and every available 10mers.

Now we have a large dataset that has averages for certain 10mers in position 1, 2, etc. If a 10mer ca