




Knowledge Discovery in Databases Using Formal Concept Analysis Uta Priss is assistant professor in the School of Library and Information Science, Indiana University, 1320 E 10th Street, Bloomington, IN 474053907. She can be reached by email at upriss@indiana.edu Formal concept analysis (FCA) is a mathematical theory of data analysis that has been used for numerous applications in fields such as social science, civil engineering, planning, psychology and linguistics. FCA supports usercentered navigation of large amounts of data and the exploration of implications, such as "if A is a cat then A can purr," that are implicit in the data. In knowledge discovery in databases (KDD), association rules are often more important than implications. As an example, consider an introductory programming course taught to graduate students. An instructor might want to know what kinds of prerequisite knowledge the students have and whether that affects their learning ability. Association rules are statements of the type "65% of all the students in this course who have a computer at home receive an A in the course." In contrast to implications, which are valid for 100% of a group or population, association rules are valid only for some subset of a population and can be trusted only to a certain degree. In the programming class example, the instructor might wonder whether she should recommend to students in the class to practice on a computer at home or whether the connection between owning a computer and receiving a good grade is irrelevant. Statistical methods can answer such questions, but by using KDD methods the instructor can ask more generally which rules are trustworthy about students in this class based on data from previous semesters. The instructor needs to decide on the levels of support and confidence that she requires for being able to trust a rule. Let's assume in this example that 61% of all students in the class own computers and 40% of all students own computers and receive grades of A. Therefore, 65% (i.e., 40/61) of the students with computers get As as stated above. The rule in consideration is "owning a computer has a positive impact for the grade in this class." The support of the rule is 40% or the percentage of students that both own computers and receive grades of A. Obviously, if only very few students – maybe only one or two students – own computers and receive As, there would not be much support for the rule. The confidence of the rule is 65%. This is the percentage of students who own computers and receive As divided by the percentage of students who own computers. If, for example, 10 students own computers and receive As, and if only 11 students in total own computers then the rule has a higher confidence level compared to a situation where 20 students own computers. Requiring higher levels of minimum support and confidence reduces the number of association rules that are considered trustworthy. In the programming course example, if the instructor decides that all rules are acceptable that have a minimum of 30% support and 60% confidence, then she would recommended that the students buy computers because that rule meets these criteria. Users can specify a desired minimum support and confidence level to focus on the most relevant rules. But even for high levels of minimum support and confidence the numbers of trustworthy rules may be large, thus requiring efficient computational algorithms and sophisticated software that presents the results to users in a format that is easy to navigate and explore. FCA provides such algorithms and software. Concept Lattices The example that follows contains fictitious data about the performance of students on an exam in a programming course taught in a social science department. The example is included here only for the purpose of explaining the use of FCA in KDD. It is not intended to draw any conclusions about programming skills or gender issues from this example. In FCA, data sets consist of a set of instances or objects, a set of attributes or characteristics and a relation between them, which identifies which attributes apply to which instances. The instances in the example are students and are represented as percentages of a total number of 30 students. Table 1 shows the original data set as a contingency table. The attributes are "grade A," "grade B," "grade C," "male" and "female." In the FCA modeling, these attributes need not be subdivided into two sets "gender" and "grade" as indicated in Table 1. They are considered five separate attributes. Figure 1 shows a type of diagram called a concept lattice that represents the data from Table 1. Each concept is represented as a node in the diagram and contains the percentage of the students to which it refers. (Only the white boxes represent concepts. The shaded boxes belong to rules and will be explained in the next section.) The top concept refers to all students (100%); the bottom concept to none (0%). The attributes are written above the concepts to which they apply and are inherited by subconcepts. For example, the concept directly under "A" refers to 57% of the students. The concept below that one is connected to "A" and "female." It represents the 33% of the students that are female and received the grade "A." It should be noted that while the percentages of "female Astudents" (33%) and "male Astudents" (24%) add up to "Astudents" (57%) that is not always the case in Figure 1. The sum of the numbers in the second row (57%, 70%, 30%, 26%) is 183 because, for example, students that are "Astudents" and "female" are counted twice. FCA software usually provides an alternative display that shows the exclusive counts for each concept if users prefer that. The diagrammatic representation visualizes conceptual relationships that are implicit in the data. For example, the fact that no students are male and female simultaneously can be read from the diagram by identifying the highest concept that is both under "female" and "male." This is the bottom concept, which refers to 0% of the students. In other words, the concepts for "male" and "female" are exclusive. The topdown direction in the lattice corresponds to conceptual inclusion. For example, the concept that refers to the grade "C" is under the concept "female" but not under "male." This means that all Cstudents (which are 17% of all students) are female in this example. Formally, this called an "implication." Being a Cstudent in this example implies that the student is female. Association Rules in Concept Lattices In this specific example, the conceptual inclusion relationships may not be surprising because it might have been guessed before drawing the diagram that a concept for "female Astudents" would be a subconcept of "female students" and of "Astudents." The only significant implication is that in this example, all Cstudents are female. But how relevant is that implication given the fact that only 30% of all students are male? In other words, if there were more male students would some of them have C's? There may be no unique answer to this question but FCA software tools provide a means for users to explore such questions and decide for themselves what is accurate or relevant. As mentioned above, the idea is to consider not only implications but also association rules. While each implication rule follows a path that leads up in the diagram, an association rule follows a path that leads down. For example, the two paths leading down from "male" point to "male A student" and "male B student." But the numbers indicate that if a student is male it is much more likely that he is an Astudent than that he is a Bstudent because 30% of all students are male, 24% students are male Astudents but only 6% are male Bstudents. The confidence level for a rule "male students are Astudents" is 80% (calculated as 24/30). The confidence level for "male students are Bstudents" is only 20%. In Figure 1 these confidence levels are indicated in the shaded boxes which are attached to lines in the diagram. Each line represents a simple association rule. It should be noted that these are only the simplest possible rules. Association rules can be more complicated and involve several concepts and longer paths. Coming back to the question of whether male students would receive C's if there were more male students, the levels of confidence and support should be considered. The percentage of students to which a concept refers is considered the support for that concept. In the example, the support for "Cstudent" is 17% and the support for male student is 30%. Both could be considered too low to draw any conclusions. Even though the confidence of "male students are Astudents" is high (80%), this rule could be considered irrelevant because the support for "male student" is too low. These types of considerations are common in statistical analysis where it is important that populations are large enough to be significant. The advantage of FCA is that a user can explore the data interactively by changing levels of support and confidence and then studying the resulting graphical representations. Table 2 shows an example which adds two further attributes: "SocSci," which stands for students who have mainly a social science or humanities background, and "CoSci," which stands for students who have had some previous computing experience. As an example, a user can reduce the complexity by selecting a 30% minimum level of support for concepts and a 50% minimum level of confidence for rules. Figure 2 shows the resulting diagram. It does not represent a complete lattice because only concepts that apply to at least 30% of the students are included. Only the lines that represent rules with confidence level over 50% are labeled. In this case, no rules about male students can be stated because, although there are 30% male students, all of the subconcepts have less support and are omitted. The following rules are available at the minimum confidence level of 50%: SocSci students are female (83%), CoSci students are female (61%), Astudents are female (58%), Astudents are CoSci (58%), CoSci students are Astudents (56%). In general, altering the requirements for the level of confidence and support can thus alter the selection of association rules. It is for the person who analyzes the data to decide which level of confidence and support is sufficient. This approach supports the view of KDD as an interactive process between a user and a database. A KDD tool should thus not be an autonomous retrieval system that provides results without explanation or without facilitating interactive exploration. Using FCA tools a user can explore the data interactively by zooming into more selective areas of the conceptual hierarchy and by modifying the support and confidence levels to get a feeling for the relationships among the concepts and the rules and implications among the attributes. It should probably be remarked that the example given in this article is quite simple. Real applications require further statistical, KDD and FCA strategies and algorithms that are beyond the scope of this article. For Further Reading About FCA Ganter, B. and Wille, R. (1999). Formal concept analysis: Mathematical foundations. Springer, Heidelberg, Germany: Springer. An FCA website: http://php.indiana.edu/~upriss/fca/fca.html About FCA and KDD Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices. Journal of Information Systems, 24(1), 2546. 


