We are looking to develop an Apache Spark or equivalent Pipeline to replace our hosted ML platform, and host the server ourselves. The current hosted platform is doing quite well, with ROC / AUC, Precision, Accuracy, Recall and F1 all above .98 (mostly above .99), but with 300,000+ Production API Transactions, it is getting quite expensive, so we are looking for alternatives.
Our current 3 classification models (in an ensemble configuration) utilize 300 or so features and seem to perform best with decision tree models, but we would like to explore different models to see which perform optimally.
We have a training data set available in a MSSQL that we can give you access to, and we will need to develop a streaming API so we can submit samples for analysis in real-time, then upload the results (along with the extracted features) to a different table in the MSSQL database. Our desktop software first checks the database to see if an identical sample has already been submitted and analyzed, and if it is not in the database, our desktop software submits the features to the hosted ML platform for analysis, then uploads the results and extracted features to the database.