With the current imdb.load_data(), the following results are seen
for different values of maxlen.
load_data (len(x_train), len(x_test))
------------------------------------------------------------
imdb.load_data(maxlen=50) --> (1035, 0)
imdb.load_data(maxlen=100) --> (5736, 0)
imdb.load_data(maxlen=200) --> (25000, 3913)
imdb.load_data() --> (25000, 25000)
Analysis: We can observe that when maxlen is low, the number
of test samples can be 0. This is because the train and test data is
concatenated, then the samples with length > maxlen are removed, and
the first 25,000 are considered as training data.
Fix: This can be fixed when data can be filtered first to remove the
ones with length > maxlen, and then concatenate to process further.
The following are the results after the fix.
fixed load_data (len(x_train), len(x_test))
------------------------------------------------------------
imdb.load_data(maxlen=50) --> (477, 558)
imdb.load_data(maxlen=100) --> (2773, 2963)
imdb.load_data(maxlen=200) --> (14244, 14669)
imdb.load_data() --> (25000, 25000)