309
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Data mining algorithm based on Renyi fuzzy association rule: an application for the selection of suitable course

, &
Article: 2271902 | Received 09 May 2023, Accepted 12 Oct 2023, Published online: 14 Nov 2023

Abstract

Data mining has been used to discover patterns in the data. It is a knowledge discovery technique to analyze hidden pattern and information from the available database. Due to uncertainty, fuzzy-based data mining techniques have been incorporated for decision-making. Fuzzy based data mining techniques model incomplete information and relations effectively in situations where lack of reasoning is given. A new fuzzy data mining algorithm has been proposed that helps in decision-making under uncertain conditions. The demonstration of the algorithm has been supported by means of a case study for the selection of suitable careers after studies. In the case study, data of 60 students studying in 4 different universities, pursuing 3 different courses have been taken. The decision making for the preparatory classes such as, further studies (F-S) or for job-oriented (JO) or coaching has been done by using the proposed method. From the results, it is concluded that student studying in the university U2 pursuing course C2 will take preparatory classes for job oriented course.

1 Introduction

Today is the age of information in which huge amount of informative data from various sources and technologies have been collected. Data Mining has become a necessity tool for analyzing huge amount of scattered data to discover pattern and knowledge as per requirement. Many data mining algorithms and tools have been used to discover interesting and useful patterns from the available dataset. In the real world, we are dealing with not only complex and multi-dimensional data set, but also with uncertain and incomplete data sets. Due to unreliability or unpredictability in the real time situations, data mining become more popular in fuzzy environment. Fuzzy set works as human works, which makes the information understandable to humans and have applications in the data mining. Many methods such as classification rules, association rules and clustering are commonly used to get the valuable information from the data. The association rule is one of the data mining techniques to tackle the massive amount of structured or unstructured data. It may be single-dimensional or multidimensional and has been combined with fuzzy data mining to handle uncertainty present in the real database. In real-life datasets, so much uncertainty is present in the data and fuzzy mathematics is a very efficient tool to deal with such uncertainty. Nowadays, fuzzy data mining becomes a very effective technique to handle big data with uncertainty. The fuzzy association rule is one of the fuzzy data mining techniques which provides interesting patterns and relations between the dataset and attributes in a fuzzy environment. In 1994, Agrawal and Srikant discussed the algorithm for association rule known as the Apriori algorithm in their research. Agrawal et al. (Citation1993) and Agrawal and Srikant (Citation1994), gave the association rule algorithm for market basket analysis which provides a rule that if item A occurs in a transaction, then how likely another item B will be in the same transaction? Klemetinen et al. (1994), gave the application of association rule in the Bread-Butter analysis that every transaction which contains Bread also contains Butter. Many mathematicians gave a comparative analysis of fuzzy association rules and their application in student performance in the academic section. Oladipupo et al. (Citation2012), presented an application of fuzzy association rule to analyze the academic performance of the students. Gupta and Mamtora (Citation2014), presented a market-basket analysis using association rule. Sharmila and Vijayarani (Citation2019), worked on the comparative study of association rule mining under a fuzzy environment. Implementing association rules is a Apriori algorithm. Algorithm based on fuzzy and Renyis entropy has been developed to establish new association rule which has applications in decision making. From the research, it is known that fuzzy based techniques have the potential to deal with uncertain situations and filter out unnecessary information and makes effective association rules in contrast with the rules formed by traditional apriori algorithm. In Section 2, literature review has been presented that covered the review of data mining techniques with fuzzy and fuzzy entropy. Basic concepts used for the preparation of the paper have been presented in Section 3. Data mining algorithm based on fuzzy association rule has been proposed in Section 4. Case study for the selection of suitable career after the studies has been demonstrated in Section 5. In Section 6, results of the case study have been given.

2 Literature review

Many data mining tools have been used that help analysts to discover knowledge by finding patterns in the data. In actual practice, the information available contains vagueness and uncertainty and to manage imprecision, several theories such as fuzzy set theory, Zadeh (Citation1965), rough set theory, Pawlak (Citation1982), and set pair theory, Zhao (1968, Citation2000) have been used. Out of these, fuzzy set theory is popular among the others because its reasoning is similar to humans and translates linguistic information. Shannon (Citation1948) introduced landmark theory by introducing the mathematical theory of communication as a statistical process, which worked well in uncertain conditions. Frawley et al. (Citation1992) and Hand et al. (Citation2001), for the past several decades, researchers are busy in the discovery of knowledge in databases to extract meaningful information from the datasets, which gives birth to the discipline of data mining. Moreover, data mining is the process of finding patterns among large relational databases. Real-world situations often arise with impreciseness and uncertainty. The knowledge presented by fuzzy sets is understandable by humans and is more compact and robust. Zadeh (Citation1965, Citation1968), proposed the concept of fuzzy set theory to deal with uncertainty which is considered an efficient tool in the process of decision-making and has applications in almost every sphere of the universe. Klir and Yuan (Citation2015) and Zimmermann (Citation1991), spread the concept of fuzzy set across domains. Luca and Termini (Citation1972), defined non-probabilistic entropy based on fuzzy set theory and proposed axioms that the fuzzy entropy measure should satisfy. According to (Buckles and Petry Citation1982; Raju and Majumdar Citation1988; Delgado and Gonzalez Citation1993; Yuan and Shaw Citation1995; Kuok et al. Citation1998; Hong et al. Citation2000), fuzzy association rules can handle quantitative data and provides support to smoother transition boundaries between partitions, therefore constitute a good solution for imprecise data. Jianjiang et al. (Citation2003) proposed the support and confidence of the fuzzy association rule. Researchers such as: (Agrawal et al. Citation1993; Clark and Matwin Citation1993; Fritzke Citation1996; Fukuda et al. Citation1996, Citation1999; Hong and Tseng Citation1997; Lee and Kwang Citation1997; Pazzani et al. Citation1997; Yoda et al. Citation1997; Fu et al. Citation1998; Liu et al. Citation1998; Han et al. Citation2000; Wang et al. Citation2000; Li et al. Citation2001; Baralis and Garza Baralis and Garza Citation2002) applied several learning algorithms based on decision trees to various domains of research. Hajek et al. (Citation2010) presented the theoretical version of GUHA method in context of data mining. Gupta et al. (2014) proposed a method based on Shannon’s Entropy in the generalized fuzzy set theory for the selection of a course for 12th-past students by establishing a relationship between the courses and student’s interest in a particular course. Intan (Citation2006) investigated that data mining association rules are considered a knowledge discovery tool that records all the possible rules for the attributes. The prime objective of association rules is to discover patterns in data sets, which can be quantitative as well qualitative in nature. Dubois et al. (Citation2006) proposed the foundation of fuzzy association rules mining. Yager (Citation1982) proposed the independency of linguistic variables on fuzzy association rules, which was further developed by Kacprzyk et al. (Citation2000). Mining of the linguistic version of IF-THEN rules with evaluation had been proposed by Novak et al. (Citation2008). The association rule of data mining has been proposed in a fuzzy environment that provides a correlation between the attributes.

3 Basic preliminaries

In this section basic concepts have been defined as:

3.1 Fuzzy sets (FS)

Zadeh (Citation1965), introduced the extension of the classical set, whose elements have degree of membership.

A fuzzy set A is defined in a finite universe of discourse X={x1,x2,,xn} as: A={<x,μA(x)>:xX} where, μA:X[0,1] is the membership function of set A and μA(x) is called the grade of membership of xX in A.

3.2 Renyi’s entropy

Renyi (Citation1960) proposed the measure of entropy, commonly known as Renyi’s entropy and is defined as: Hα(P)=11αlog(k=1npkαk=1npk),α1,α>0,

3.3 Fuzzy association rule

Every association rule follows three steps as given in . In many real-life applications of relations, data patterns are fuzzy, otherwise mapping crisp data to fuzzy data improves the evaluation of semantic information and is defined as:

Fig. 1 Steps in association rule mining.

Fig. 1 Steps in association rule mining.

Let X={x1,x2,,xm} be the set of items in a fuzzy transaction set Ω={ω1,ω2,,ωn} and let A,BX be two non-empty disjoint crisp subsets, known as itemset.

For each transaction ωiX, the fuzzy association rule is defined as: ABhold inΩ,iff μ˜(A)μ˜(B);μ˜   Ωwhere A,BX and AB=ϕ.

In large databases, association rule learning is a rule based technique for discovering meaningful relations between the variables.

For example

Let A={a1,,an} and T={t1,,tn} be a set of attributes, known as items and transaction, known as database. Each transaction in T has a unique ID contains the subset of the item in A. The rule is defined as XY, where X and Y are disjoint item sets and called as antecedent and consequent respectively. i.e., if students opt for engineering course, they are also likely to opt for basic sciences courses.

3.4 Apriori algorithm

To discover frequent item sets from the dataset for the association rule, Apriori algorithm is used (Thiruvady Citation2003). This algorithm uses prior knowledge of frequent item set properties to discover association rules, follows two step processes as:

Step 1: To find all the frequent datasets considered for the association rules.

Step 2: From the given frequent itemsets, association rules have been generated From the seed frequent item set, candidate set of item sets have been constructed. In general, to construct λ candidate sets, λ1 frequent itemsets are required. Finally, these items are tested for minimum support (min-sup) and maximum confidence (max-conf). It is an iterative process, which continues till the termination of frequent item sets.

Apriori algorithm is used to extract frequent pattern from the itemsets to establish relationship between items. The frequency of itemsets is measured by means of number of transactions in which they appear. In other words, an itemset is a collection of one or more items occur together in a dataset. An itemset could be either a single item or a set of multiple items. The algorithm uses prior knowledge about the frequent itemsets, that is why it is called apriori, which means that if an itemset is frequent then all if its subsets must be frequent.

For example

If the itemset {I, II, III} frequently appears in a dataset, then the subsets: {I, II}, {I, III}, {II, III}, {I}, {II}, and {III} must also appear frequently in the dataset.

3.5 Support and confidence (Han and Kamber Citation2006)

Support

It is the frequency with which the itemset occurs in the dataset and is denoted by S. For a given association rule (AB), Support(S)=No. of transactions containing bothAand BTotal number of transactions

Confidence

It is the measure of the accuracy of the rule and is denoted by C.

For a given association rule (AB), Confidence(C)=Support(AB)Support(A) OrConfidence(C)=No. of transactions containing bothAand BTotal number of transactions containing A

In other words, it is conditional probability P(B|A) that indicates the degree of correlation in the dataset.

4 Proposed algorithm for association rule

In 1994, Agarwal and Srikant gave a seminal algorithm called Apriori for mining frequent item sets for association rules. As name suggest, i.e., A-Prior-I which means prior knowledge of frequent itemset can be used for mining of frequent item sets. It is an iterative method known as level wise search where the previous item sets are used to explore the later item sets.

On the same tune, new fuzzy based algorithm has been proposed by introducing Renyis entropy. The steps of the proposed algorithm are given as:

Step 1: Structure the given data set into tabular form with transaction IDs and other details of the item sets in the transaction.

Step 2: Prepare 1-itemset by collecting the count of each item from the database.

Step 3: Separate those item sets that satisfy minimum support (to be defined by the domain experts), denoted by L1.

Step 4: Using L1 to find L2, that is 2-itemset and L2 is used to find L3 and so on.

Step 5: Verify Apriori property, i.e. all nonempty subsets of frequent itemset must also be frequent.

Step 6: Setting up association rule from frequent item sets, by calculating confidence C proposed by Han and Kamber (Citation2006).

Step 7: Calculate Renyi’s entropy against confidence C for each association rule given by the relation:

H=11αlog2(Cα+(1C)α);α>0,α1

where, C is the value of confidence for the association rule.

Step 8: For strong association rule, minimum entropy association rule has been found.

In the next section, the demonstration of the algorithm has been discussed on the basis of the case study.

5 Case study

Khare et al. (Citation2009) discussed the algorithm for fuzzy association rule. Algorithm (4) has been discussed by means of illustrating an example.

Consider a data set of 60 students studied in four universities namely as: U1,U2,U3 and U4 pursuing one of the given courses as C1, C2 and C3. Association rule has been generated that helps a student of a particular university pursuing a specific course that how much likely to take Preparatory classes for further studies(FS) or for job-oriented(JO) or coaching(NO). Let, University={U1,U2,U3,U4};Course={C1,C2,C3}andPreparatory Classes={FS,JO,NO}

Step 1: Setting up the data set into tabular form with student IDs and other transaction details given in Appendix under .

Suppose that the maximum threshold value be λ = 4, which means that the dimension of the attribute will not be more than 4.

In the , all the attribute (item) satisfies the threshold range.

Arbitrarily, taking support for k itemset.

Assuming that for k=1,2 and 3;δ1=3,1.5 and 0.5 respectively.

The item-sets are calculated for different values of δ=3,, explained in the following steps.

Step 2: Write the given dataset in 1-itemset fuzzy form as explained in detail in the Appendix under –A.4.

Table 2 L2 Score of students 2-itemset.

Step 3: Separate the 1-itemsets that satisfy minimum support, denoted by L1.

For δ1=3, for 1-itemset, the values of L1 is given as ():

Table 1 L1 Score of 1-itemset.

Step 4: Using L1 score, calculate L2 score for 2-itemset as given in Appendix under –A.11.

Table 5 Confidence of each association rule.

For δ2=1.5, for 2-itemset, the values of L2 is given as ( and ):

Table 3 L2 Score of students 2-itemset.

The following frequent itemsets are not considered for further process are given below as: {U1,C2;U1,C3;U1,NO;U2,C1;U2,C3;U2,NO;U3,C1;U3,C2;U3,NO;U4,C1;U4,C2;U4,C3;U4,FS;U4,JO;U4,NO;C1,JO;C1,NO;C2,FS;C2,NO;C3,JO}

Step 5: Using L2, calculate L3 for 3-itemsets as given in Appendix under .

For δ3=0.5, the values of L3 for 3-itemset is given as ():

Table 4 L3 Score of 3-itemset.

Step 6: Using (3.5), calculate the confidence for each association rule as ():

Step 7: Calculate the Renyis entropy measure for different values of parameter α against each association rule as:

6 Result and discussion

From the , it is observed that association rule no. II ({U2,C2}JO) has maximum confidence and minimum entropy for different values of parameter α. Thus, the rule no. II is the strongest association rule amongst all and is the decision value for the given situation. Therefore, the student studying in the university U2 pursuing course C2 will take preparatory classes for job oriented course (JO).

Table 6 Renyi’s entropy measure of association rules for different values of the parameter α.

7 Conclusion

The main focus of the research work is to propose a new data mining algorithm comprises of hybrid techniques of fuzzy sets along with fuzzy entropy measure. Renyi’s entropy of order α has been used to propose a new algorithm. Two-layer fuzzy data mining algorithm of association rules with fuzzy entropy measure has been proposed to calculate the support and confidence of the dataset in the first layer and finally calculate the entropy of confidence to get the most promising results. Here, fuzzy-based association rules have been framed for the identification of relations between universities and the likely courses. The attributes in the database have been taken as a fuzzy number. From the results, it is concluded that fuzzy information-based algorithms provide more promising results. The proposed algorithm has been applied to other decision-making problems to identify intersecting relations in the item sets.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

References

  • Agrawal R, Imielinski T, Swami AN. 1993. Mining association rule between sets of items in large database. Proceedings of the ACM SIGMOD International Conference on Management of Data, p. 207–216.
  • Agrawal R, Srikant R. 1994. Fast algorithm for mining association rules. Proceedings of 20th International Conference on Very Large Databases, p. 487–499.
  • Baralis E, Garza P. 2002. A lazy approach to pruning classification rules. IEEE International Conference on Data Mining, 35-42,
  • Buckles BP, Petry FE. 1982. A fuzzy representation of data for relational databases. Fuzzy Sets Syst. 7:213–226.
  • Clark P, Matwin S. 1993. Using qualitative models to guide inductive learning. International Conference on Machine Learning, p. 49–56.
  • Delgado M, Gonzalez A. 1993. An inductive learning procedure to identify fuzzy systems. Fuzzy Sets Syst. 55:121–132.
  • Dubois D, Hüllermeier E, Prade H. 2006. A systematic approach to the assessment of fuzzy association rules. Data Min Knowl Disc. 13:167–192.
  • Frawley WJ, Piatetskyshapiro G, Matheus CJ. 1992. Knowledge discovery in databases- An overview. Ai Magazine. 13:57–70.
  • Fritzke B. 1996. Growing Self-Organizing Networks- Why?. ESANN’ 96: European Symposium on Artificial Neural Networks, p. 61–72.
  • Fu AWC, Wong MH, Sze SC, Wong WC, Wong WL, Yu WK. 1998. Finding fuzzy sets for the mining of fuzzy association rules for numerical attributes. International Symposium on Intelligent Data Engineering and Learning (Ideal ’98), Hong Kong, p. 263–268.
  • Fukuda T, Morimoto Y, Morishita S, Tokuyama T. 1996. Data mining using two-dimensional optimized association rules: scheme, algorithms and visualization. SIGMOD Rec. 25:13–23.
  • Fukuda T, Morimoto Y, Morishita S, Tokuyama T. 1999. Mining optimized association rules for numeric attributes. J Comput Syst Sci. 58:1–12.
  • Gupta S, Mamtora R. 2014. A survey on association rule mining in market-basket analysis. Int J Inf Comput Technol. 4:409–414.
  • Gupta P, Prince, Kumar V. 2014. Selection of course for the intermediate passed out students by using fuzzy information measure. Int J Appl Eng Res. 9:1331–1336.
  • Hajek P, Holeòa M, Rauch J. 2010. The GUHA method and its meaning for data mining. J Comput Syst Sci. 76:34–48.
  • Hand DJ, Smyth P, Mannila H. 2001. Principles of data mining. Cambridge: MIT Press.
  • Han J, Kamber M. 2006. Data mining: concepts and techniques. Amsterdam, Boston, Elsevier, San Francisco, CA: Morgan Kaufmann.
  • Han J, Pei J, Yin Y. 2000. Mining frequent patterns without candidate generation. In Proc. ACM-SIGMOD, p. 1–12.
  • Hong TP, Tseng SS. 1997. A generalized version space learning algorithm for noisy and uncertain data. IEEE Trans Knowl Data Eng. 9:336–340.
  • Hong TP, Kuo CS, Chi SC, Wang SL. 2000. Mining fuzzy rules from quantitative data based on the AprioriTid algorithm. ACM SAC 2000 Como, 534–536,
  • Intan R. 2006. An algorithm for generating single dimensional fuzzy association rule mining. Jurnal Informatika. 7:61–66.
  • Jianjiang L, Bbaowen X, Hongji Y. 2003. A classification method of fuzzy association rules. IEEE International workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, p. 8–10.
  • Kacprzyk J, Yager RR, Zadrozny S. 2000. A fuzzy logic based approach to linguistic summaries of databases. Int J Appl Math Comput Sci. 10:813–834.
  • Khare N, Adlakha N, Paradasani KR. 2009. An algorithm for mining multidimensional fuzzy association rules. Int J Comput Sci Inf Secur. 5:72–76.
  • Klemetinen L, Mannila H, Ronkainen P. 1994. Finding interesting rules from large sets of discovered association rules. International Conference on Information and Knowledge Management Gaithersburg USA, p. 401–407,
  • Klir GJ, Yuan B. 2015. Fuzzy sets and fuzzy logics: theory and applications. Noida: Pearson Education India.
  • Kuok CM, Fu A, Wong MH. 1998. Mining fuzzy association rules in databases. SIGMOD Rec. 27:41–46.
  • Lee JH, Kwang HL. 1997. An extension of association rules using fuzzy sets. Seventh International Fuzzy Systems Association World Congress.
  • Li W, Han J, Pei J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. Proc. IEEE International Conference on Data Mining, p. 369–376.
  • Liu B, Hsu W, Ma Y. 1998. Integrating classification and association rule mining. Proc. Of 4th International Conference on Knowledge Discovering and Data Mining KDD ’98, p. 80–86.
  • Luca AD, Termini S. 1972. A definition of non-probabilistic entropy in the setting of fuzzy set theory. Inf Control. 20:301–312.
  • Novak V, Perfilieva I, Dvoøák A, Chen G, Wei Q, Yan P. 2008. Mining pure linguistic associations from numerical data. Int J Approx Reason. 48:4–22.
  • Oladipupo OO, Oyelade OJ, Aborisade DO. 2012. Application of fuzzy association rule mining for analyzing students academic performance. Int J Comput Sci Issue. 9:216–223.
  • Pawlak Z. 1982. Rough sets. Int J Comput Inf Sci. 11:341–356.
  • Pazzani MJ, Mani S, Shankle WR. 1997. Beyond concise and colorful: learning intelligible rules. International Conference on Knowledge Discovery and Data Mining, p. 235–238.
  • Raju KVSVN, Majumdar AK. 1988. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Trans Database Syst. 13:129–166.
  • Renyi A. 1960. On measures of entropy and information. Proc. 4th Berkeley Symp. Math. Stat. and Probability, vol. 1, p. 547–561.
  • Shannon CE. 1948. A mathematical theory of communication. Bell Syst Tech J. 27:379–423.
  • Sharmila S, Vijayarani S. 2019. Comparative analysis of fuzzy association rule mining algorithms. Int J Sci Technol Res. 8:991–995.
  • Thiruvady DR. 2003. Mining negative rules in large databases using GRD [M.sc thesis]. School of Computer Science and Software Engineering at Monash University, November, Copyright by Dhananjay R. Thiruvady.
  • Wang X, Chen B, Qian G, Ye F. 2000. On the optimization of fuzzy decision trees. Fuzzy Sets Syst. 112:117–125.
  • Yager R. 1982. A new approach to the summarization of data. Inf Sci. 28:69–86.
  • Yoda K, Fukuda T, Morimoto Y, Morishita S, Tokuyama T. 1997. Computing optimized rectilinear regions for association rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, p. 96–103.
  • Yuan Y, Shaw MJ. 1995. Induction of fuzzy decision trees. Fuzzy Sets Systems. 69:125–139.
  • Zadeh LA. 1965. Fuzzy sets. Inf Control. 8:338–353.
  • Zadeh LA. 1968. Probability measures of fuzzy events. J Math Anal Appl. 23:421–427.
  • Zhao KQ. 1989. Set pair analysis–a new concept and new systematic analysis method. Proceedings of the National Conference on System Theory and Regional Planning, Baotou, p. 87–91.
  • Zhao KQ. 2000. Set pair analysis and its preliminary application. Hangzhou: Zhejiang Science and Technology Press.
  • Zimmermann HJ. 1991. Fuzzy set theory- and its applications. Dordrecht: Kluwer Academic Publishers.

Appendix

Table A.1 Transaction details of the 60 students with student ID.

Table A.2 1-itemset of universities of the given dataset of students.

Table A.3 1-itemset of the courses of the given dataset of students.

Table A.4 1-itemset of the preparatory classes of the given dataset of students.

Table A.5 2-itemset of students using L1.

Table A.6 2-itemset of students using L1.

Table A.7 2-itemset of students using L1.

Table A.8 2-itemset of students using L1.

Table A.9 2-itemset of students using L1.

Table A.10 2-itemset of students using L1.

Table A.11 2-itemset of students using L1.

Table A.12 3-itemset of students using L2.