Exploring Multimodal Features and Fusion for Time-Continuous Prediction of Emotional Valence and Arousal