I wrote a apache spark scala program to find tf-idf using corpus, It's hanging on at point near group by statement. I want someone can fix that issue.
I have list of articles stored in s3 as parquet, so first I'm reading it as dataframe and creating n-grams and keeping it in one hand.
On other hand, s3 has 10k posts (corpus) as parquet. I'm reading it as dataframe and keeping it.
So now I want to find document frequency for each term (n-gram) against corpus
[Removed by Freelancer.com Admin]
I have written, I'm willing to share it to the right candidate
Word2Vec knowledge is plus