ARTICLE INFO

Article Type

Original Research

Authors

Alizadeh‎   S. (1)
Asghari   M. (*)
Hosseini   M.K. (1)






(*) Information Technology Department, Computer Engineering Faculty, K. N. Toosi University of Technology, Tehran, Iran
(1) Information Technology Department, Computer Engineering Faculty, K. N. Toosi University of Technology, Tehran, Iran

Correspondence


Article History

Received:  February  26, 2016
Accepted:  June 21, 2016
ePublished:  August 15, 2017

BRIEF TEXT


Data mining is a fast-growing interdisciplinary field that combines various domains, such as ‎databases, statistics, machine learning, and other related fields to extract valuable information and ‎knowledge that lies in the large amount of data [1]. One of the medical fields in which data mining ‎techniques can be used is infertility.‎

‎... [2-4]. Of the commonly used treatments for infertility are assisted reproductive treatment (ART) that ‎have an effective role in treating infertility with various causes. Among the commonly used assisted ‎fertility techniques, ICSI, GIFT, IVF, IUI, and ZIFT can be mentioned [4].‎ Many studies have been conducted using data mining algorithms in the field of infertility, each ‎focusing on a specific factor. In a research, with the aim of evaluating the success rate of IUI treatment, ‎various factors such as the duration of infertility, couples' characteristics, sperm test status, etc. on the ‎data, concluded that the best treatment for infertility due to ovarian cancer is obtained by the IUI ‎method. [5].‎ In another study to determine the success rate of treatment with IVF, various factors affecting fertility ‎were examined. The results showed that ovarian response to ovulation stimulation and the number of ‎transferred embryos are important and effective factors in predicting IVF outcome [6].‎ In a study by examining the number of embryos transmitted in ART, it was found that by transferring 2 ‎or 3 fetus instead of 1 fetus, the probability of pregnancy increases significantly, on the other hand, ‎more increase of the numbers of the fetus is not significant and it increased the risk of multiple birth ‎‎[7].‎

The aim of this study was to analyze the factors affecting the results of intrauterine insemination using ‎clustering.‎

This research is data mining. ‎





This study was conducted based on the well-known Cross-industry standard process for data mining ‎‎(CRISP, Figure 1). The processes followed in this study were:‎ Business Recognition: In this step of the analysis, the processes in the hospital were examined. This ‎stage is one of the most important stages of the research and during this project, it was repeatedly ‎returned to this stage to better understand the system of the hospital and to evaluate the results of data ‎mining with it.‎ Recognition of data: Patients refer to Sarem Hospital for various treatments including ICSI, IUI, GIFT, ‎ZIFT and IVF. Couples personal information and test results are recorded in their files. In this phase, ‎with consultation and assistance from the hospital team, all of the available data in the Sarem Hospital ‎were identified. This recognition includes all data records, data attributes, as well as familiarity with ‎the concepts of a variety of treatments (including types of tests).‎ Data preparation: one of the important phases of this study was the stage of data purification and ‎preparation for the data mining process, which was known as the preprocessing stage of the data. At ‎this stage, the data was selected for ease of research and purposefulness of the results, and the records ‎were collected from the hospital records, which differed in different fields of medicine and time, and ‎the data was collected and a survey on this category of data was started. In this phase, one type of ‎infertility technique was considered to limit the analysis. Data were collected from patients who ‎performed IUI surgery, and the clustering algorithm was performed on them and analyzed. Also, among ‎the fields in the database, some of the more effective parameters that were obtained by repeated ‎experiments in the field of previous research in this field and consulted by experienced people were ‎selected and used (Table 1).‎ After recognizing the hospital system and familiarizing with the processes that are carried out during ‎the treatment for patients, in the later stages, the recognition of the data and then the selection of the ‎required data from the massive amount of data to the stage of clearing and preprocessing the data was ‎done. At this point, the data was reviewed, categorized, deleted or modified. The result of data ‎purification was 400 cases between 1997 and 2009 from patients in Sarem Hospital.‎ In the next step, the selected clustering model was run on a prepared statistical sample to extract the ‎knowledge and patterns in this data.‎ Modeling: An uncontrollable (descriptive) algorithm was used to model data. One of the most ‎commonly used descriptive algorithm is the clustering algorithm used to analyze infertility data in this ‎study. Descriptive algorithms are used to discover similar patterns between different data sets. The ‎basic issue in the clustering algorithm used in the research is to distribute the data to a K group, so that ‎the data of each group be the same and the data of the different groups be inconsistent, which is called ‎‎"K-means" [9]. By using this non-explicit algorithm, a general analysis of data and a sort of data split ‎can be obtained. In using the clustering algorithm, a selection of an index is needed. In this research, ‎the Davis-Boulder Index was used. By reducing the value of this indicator, the number of clusters ‎similar to that indicator yields a better result.‎ To achieve a better analysis and better interpretation of the results, several clustering algorithms were ‎implemented with different indices on the data. This method, in addition to guidance for choosing the ‎best clustering index, helped to correct features and data. The clustering algorithms were performed ‎multiple times using different indices on the data, and it was finally determined that the Dunn and ‎Davis index were better indicators for implementing the algorithms on these data.‎ Dunn Index: This index is one of the most common indices in the assessment of the number of ‎clusters. Considering different ways of calculating the distance between two clusters and the size of the ‎cluster, different indices can be obtained from Dunn family. ‎ Davies-Bouldin Index: This index is a function of the ratio of the distribution of intra-cluster ‎dispersion to the separation of clusters [10].‎ Choosing the number of optimum clusters: The equations for the two above mentioned indicators ‎were implemented in the corresponding software, and the clustering of up to 7 clusters, were ‎performed on IUI data that took the following results: According to Dunn and Davies algorithms, if circles are drawn from the center of the clusters, it seems ‎that the clusters are intersecting (Chart. 1 and Fig. 2).‎ Therefore, the Dunn index did not provide a good result for this clustering, because, as mentioned, this ‎index is based on the distance between the cluster and the diameter of the cluster, that the higher the ‎number, the better it is. In comparison, The Davies index focuses on the density-to-distance ratio, and ‎does not stand on distance alone, resulting in less sensitivity to the cross between clusters. The ‎clusters seem to intersect each other while it is not true (Figure 2).‎ Considering the optimal number of each of these indices, the Dunn index considered the 2 clusters as ‎the best and Davis index considered 6 clusters as the most optimal number of clusters, and ‎considering the discussion, 6 clusters were considered as the optimum number of clusters. Since it is ‎possible to be exposed to local optimal in K-means algorithm, we used another algorithm called EM ‎‎(Expectation Maximization) [11] to insure the result of the K-means algorithm and the DB index. The ‎results of the EM algorithm also provided us with 6 clusters. In this algorithm, for each sample, a ‎probability distribution is assigned that determines the probability of each cluster. This algorithm can ‎make a decision about the number of clusters appropriately implemented using the internal cross ‎validation [11] algorithm that it implements. The algorithm was implemented in the WEKA program. ‎As stated, the cross validation algorithm detects the number of appropriate clusters, which is done as ‎follows.‎ The two algorithms presented the same result with two different approaches, but in the final stage of ‎certainty assurance of the number of clusters, the results of these two algorithms were checked with ‎experts in this area at Sarem Hospital in order to ensure the accuracy of the implementation of these ‎algorithms (Fig. 3).‎

After 7 times implementation of the K-means algorithm that was selected for data clustering, and ‎considering the Davis-Bouldin index selection for choosing the number of optimum clusters, 6 clusters ‎were selected and the dispersion of patients in each cluster was determined (Table 2).‎In following, each of the six selected clusters is analyzed. The results of cluster 1 are shown in Table 3. ‎Six samples of samples in the top cluster were not recorded in the database for their success or failure ‎‎(Table 4).‎The results obtained from the second cluster are shown in Table 5 and the results of the success and ‎failure in this cluster are also described in Table 6.‎The results obtained from the third cluster are also described in Table 7, and the results of success and ‎failure are presented in Table 8.‎Based on Table 8, 1 sample of samples in the top cluster was not recorded in the database for their ‎success or failure. The results obtained from the fourth cluster in Table 9 and the results of the success ‎and failure of this cluster are also listed in Table 10.‎The results obtained from the fifth cluster are also described in Table 11; the results of the success and ‎failure in this cluster are also described in Table 12.‎The results of the sixth cluster are also described in Table 13. According to Table 13, 5 samples of ‎samples in the top cluster were not recorded in the database for their success or failure. The results of ‎the success and failure in this cluster are also in Table 14. Based on Table 14, 6 samples of the samples ‎in the top cluster were not recorded in the database for their success or failure.‎

In a similar study to the present study, but on IVF, the success rates of treatment with IVF and various ‎factors affecting fertility were evaluated. The findings indicated that ovarian response to ovulation ‎stimulation and the number of transferred fetus are important and effective factors in predicting IVF ‎outcomes [5]. In 2008, another study was conducted on the probability of gestation with the number of ‎embryos transmitted. By examining the number of embryos transmitted in ART, it was determined by ‎transferring 2 or 3 fetuses instead of 1 fetus, the probability of pregnancy increases significantly, and ‎by increasing more fetus, the increase in the rate of the probability is not significant and does not cost ‎risk of multiple birth. In that study, using Bayesian‏ ‏network techniques, the probability of multi birth ‎was investigated [7].‎ In other areas, such as prediction of IVF treatment, research has been done, such as a research in 2010 ‎that found 39 features based on sampling, characterization and data cleansing, to enhance accuracy, the ‎combination of decision tree and genetic algorithm was used [12]. Other systems that can be ‎mentioned in the field of infertility are the system for predicting the appropriate type of IVF treatment ‎using CBR, called the TA3IV system. The goal of the study is to provide CBR with stored experiences a ‎possible way to increase the success rate for IVF. After knowledge is a system based on a good ‎population of samples, the feasibility of exploiting the knowledge used out of a large amount of data is ‎found. In this study, the TA3IV system, which is based on a Danish-based CBR system, has been ‎described. This system can be a sponsor with doctors [13]. In a study, 2450 couples were studied, ‎which was also a good basis for selecting features in that study, but only a statistical survey of patients ‎was noted. Among the selected features were BMI, type of infertility, infertility length, age, etc. [14]. ‎The main difference in this research is that despite the field of success and failure in performing the IUI ‎operation, they first were removed from the feature, and only the status of the patients was evaluated, ‎because the main consideration was obtaining the general knowledge of the patients in the hospital and ‎then following the analysis or the data, the success or failure of each patient could be evaluated based ‎on the their clustering , and this can be used as base in the future to create a decision-making system.‎





Factors such as age, body mass, type of infertility, cause of infertility, etc. can determine the success ‎rate of the IUI method.‎ ‎

Dr. Saremi, the dear chairman of the hospital, and Dr. Salehian, as well as the competent and highly ‎qualified staff of the Sarem Medical Women's Specialized Hospital, who helped us in the various stages ‎of this study, especially the work of Ms. Ghafari and Ms. Shami are thanked and appreciated. ‎







TABLES and CHARTS

Show attach file


CITIATION LINKS

[1]Chakrabarti S, Ester M, Fayyad U, Gehrke J, Han J, Morishita S, et al. Data Mining Curriculum: A Proposal ‎‎(Version 1.0) [Internet]. London: The community for data mining, data science and analytics, SIGKDD; 1999. ‎‎[updated 2006 Apr 30; cited 2007 Dec 14]. Avilable from: www.kdd.org/exploration_files/CURMay06.pdf.‎
[2]Berry GT, Baker L, Kaplan FS, Witzleben CL. Diabetes-like renal glomerular disease in Fanconi-Bickel ‎syndrome. Pediatr Nephrol. 1995;9(3):287-91. ‎
[3]Han J, Kamber M. Data Mining: Concepts and Techniques. 2th ed. Burlington: Morgan Kaufmann; 2011.‎
[4]Saremi AT. Infertility Guideline. Tehran: Sarem Research Center; 2009. [Persian]‎
[5]Vahid Roudsari F, Ayati S, Mirzaeeyan S, Shakeri MT, Akhtardel H. Fertility outcome after IVF and related ‎factors. J Gorgan Univ Med Sci. 2009;11(3):42-6. [Persian]‎
[6]Morales DA, Bengoetxea E, Larranaga P, Garcia M, Franco Y, Fresnada M, et al. Bayesian classification for the ‎selection of in vitro human embryos using morphological and clinical data. Comput Methods Programs ‎Biomed. 2008;90(2):104-16. ‎
[7]Sohrabvand F, Shariat M, Fotoohi Ghiam N, Hashemi M. The relationship between number of transferred ‎embryos and pregnancy rate in ART cycles. Tehran Univ Med J. 2009;67(2):132-6. [Persian]‎
[8]Gazanfari M, Alizadeh S, Teimourpur B. Data Mining and Knowledge Discovery. Tehran: Iran University of ‎Science and Industry Press; 2011.‎
[9]Han L, Zhong Y, Huang B, Han L, Pan L, Xu X, et al. Sodium butyrate activates erythroid-specific 5-‎aminolevulinate synthase gene through Sp1 elements at its promoter. Blood Cells Mol Dis. 2008;41(2):148-‎‎53. ‎
[10]Gonzalez T, Marggie D. A comparison in cluster validation techniques [Dissertation]. Puerto Rico: University of ‎Puerto Rico; 2006.‎
[11]Witten IH, Frank E, Hall M. Data Mining: Practical Machine Learning Tools and Techniques. Burlington: ‎Morgan Kaufmann; 2011.‎
[12]Guh RS, Wu TCJ, Weng SP. Integrating genetic algorithm and decision tree learning for assistance in ‎predicting in vitro fertilization outcomes. Expert Syst Appl. 2011;38(4):4437-49.‎
[13]Jurisica I, Mylopoulos J, Glasgow J, Shapiro H, Casper RF. Case-based reasoning in IVF: Prediction and ‎knowledge mining. Artif Intell Med. 1998;12(1):1-24.‎
[14]Cai Qf, Wan F, Huang R, Zhang HW. Factors predicting the cumulative outcome of IVF/ICSI treatment: A ‎multivariable analysis of 2450 patients. Hum Reprod. 2011;26(9):2532-40.‎