- Difference in rc.local and lxde autostart
- Why is the recipe of Coca Cola still a secret?
- What formula is used to find temperature when ruptured?
- Minimum cost circulation problem with bounded number of edges
- Grad school and pregnancy
- Modeling a stellar atmosphere
- Identify this large (swamp) fly?
- Thermoregulation in Humans
- Please help me identify this plant
- What is this character? How to identify it?
- Can someone help me ID this insignia?
- Why does Strong's have two different entries for the word “Adam”?
- The value of the 153 large fish in John 21:11?
- What are SNFS-safe limits for an RSA moduli optimized for simple modular reduction?
- Can I cancel a EU trip after the airline significantly changed my flight?
- How to get State Visitor Permit for a rented self drive car in India?
- Best Itinerary for first-time visitor to Europe
- Should I carry my passport with me while sightseeing in Paris?
- Meaning of this phrase "つじつま合わせに生まれた僕等''
- Any websites or free resources providing Japanese hiragana/katakana pronunciation with native male voice?
Calculate features by averaging across training dataset?
We have a large number of examples (50 nucleotide long DNA sequences) for one of our machine learning task. For every example, there is the outcome that first is continuous (CON) in the range of 0-1. Since most of the examples are around 1 (about 98%) and about 2% is around 0, with very few examples between the two (that can be discarded), we have separated the labels for two clasess (0 vs 1, CLA) making it a classification and not a regression problem.
However, when calculating features, we have an ongoing debate that bothers me. One of the collaborators would like to do the following to calculate some of the features:
For every 10 long sequence that can be found in the dataset:
get every examples with that 10mer in the 1st position
average all CON of them and assign it to the n-mer in the first position
Repeat for 2nd, 3rd, etc position and every available 10mers.
Now we have a large dataset that has averages for certain 10mers in position 1, 2, etc. If a 10mer ca