合肥生活安徽新聞合肥交通合肥房產(chǎn)生活服務(wù)合肥教育合肥招聘合肥旅游文化藝術(shù)合肥美食合肥地圖合肥社保合肥醫(yī)院企業(yè)服務(wù)合肥法律

        CS439編程代寫、代做Java程序語言
        CS439編程代寫、代做Java程序語言

        時(shí)間:2024-10-13  來源:合肥網(wǎng)hfw.cc  作者:hfw.cc 我要糾錯(cuò)



        CS439: Introduction to Data Science Fall 2024 
         
        Problem Set 1 
         
        Due: 11:59pm Friday, October 11, 2024 
         
        Late Policy: The homework is due on 10/11 (Friday) at 11:59pm. We will release the solutions 
        of the homework on Canvas on 10/16 (Wednesday) 11:59pm. If your homework is submitted to 
        Canvas before 10/11 11:59pm, there will no late penalty. If you submit to Canvas after 10/11 
        11:59pm and before 10/16 11:59pm (i.e., before we release the solution), your score will be 
        penalized by 0.9k
        , where k is the number of days of late submission. For example, if you 
        submitted on 10/14, and your original score is 80, then your final score will be 80*0.93
        =58.** 
        for 14-11=3 days of late submission. If you submit to Canvas after 10/16 11:59pm (i.e., after we 
        release the solution), then you will earn no score for the homework.  
         
        General Instructions 
         
        Submission instructions: These questions require thought but do not require long answers. 
        Please be as concise as possible. You should submit your answers as a writeup in PDF format, 
        for those questions that require coding, write your code for a question in a single source code 
        file, and name the file as the question number (e.g., question_1.java or question_1.py), finally, 
        put your PDF answer file and all the code files in a folder named as your Name and NetID (i.e., 
        Firstname-Lastname-NetID.pdf), compress the folder as a zip file (e.g., Firstname-LastnameNetID.zip),
        and submit the zip file via Canvas. 
         
        For the answer writeup PDF file, we have provided both a word template and a latex template 
        for you, after you finished the writing, save the file as a PDF file, and submit both the original 
        file (word or latex) and the PDF file. 
         
        Questions 
         
        1. Map-Reduce (35 pts) 
         
        Write a MapReduce program in Hadoop that implements a simple “People You Might Know” 
        social network friendship recommendation algorithm. The key idea is that if two people have a 
        lot of mutual friends, then the system should recommend that they connect with each other. 
         
        Input: Use the provided input file hw1q1.zip. 
         
        The input file contains the adjacency list and has multiple lines in the following format: 
        <User><TAB><Friends> 
         Here, <User> is a unique integer ID corresponding to a unique user and <Friends> is a commaseparated
         list of unique IDs corresponding to the friends of the user with the unique ID <User>. 
        Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B, then B is 
        also friend with A. The data provided is consistent with that rule as there is an explicit entry for 
        each side of each edge. 
         
        Algorithm: Let us use a simple algorithm such that, for each user U, the algorithm recommends 
        N = 10 users who are not already friends with U, but have the largest number of mutual friends 
        in common with U. 
         
        Output: The output should contain one line per user in the following format: 
         
        <User><TAB><Recommendations> 
         
        where <User> is a unique ID corresponding to a user and <Recommendations> is a commaseparated
         list of unique IDs corresponding to the algorithm’s recommendation of people that 
        <User> might know, ordered by decreasing number of mutual friends. Even if a user has 
        fewer than 10 second-degree friends, output all of them in decreasing order of the number of 
        mutual friends. If a user has no friends, you can provide an empty list of recommendations. If 
        there are multiple users with the same number of mutual friends, ties are broken by ordering 
        them in a numerically ascending order of their user IDs. 
         
        Also, please provide a description of how you are going to use MapReduce jobs to solve this 
        problem. We only need a very high-level description of your strategy to tackle this problem. 
         
        Note: It is possible to solve this question with a single MapReduce job. But if your solution 
        requires multiple MapReduce jobs, then that is fine too. 
         
        What to submit: 
         
        (i) The source code as a single source code file named as the question number (e.g., 
        question_1.java). 
         
        (ii) Include in your writeup a short paragraph describing your algorithm to tackle this problem. 
         
        (iii) Include in your writeup the recommendations for the users with following user IDs: 
        924, 8941, 8942, **19, **20, **21, **22, 99**, 9992, 9993. 
         
         
        2. Association Rules (35 pts) 
         
        Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to 
        understand the purchase behavior of their customers. This information can be then used for many different purposes such as cross-selling and up-selling of products, sales promotions, 
        loyalty programs, store design, discount plans and many others. 
         
        Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need to 
        choose a subset of them as your recommendations. Commonly used metrics for measuring 
        significance and interest for selecting rules for recommendations are: 
         
        2a. Confidence (denoted as conf(A → B)): Confidence is defined as the probability of 
        occurrence of B in the basket if the basket already contains A: 
         
        conf(A → B) = Pr(B|A), 
         
        where Pr(B|A) is the conditional probability of finding item set B given that item set A is 
        present. 
         
        2b. Lift (denoted as lift(A → B)): Lift measures how much more “A and B occur together” than 
        “what would be expected if A and B were statistically independent”: 
        * and N is the total number of transactions (baskets). 
         
        3. Conviction (denoted as conv(A→B)): it compares the “probability that A appears without B if 
        they were independent” with the “actual frequency of the appearance of A without B”: 
         
        (a) [5 pts] 
         
        A drawback of using confidence is that it ignores Pr(B). Why is this a drawback? Explain why lift 
        and conviction do not suffer from this drawback? 
         
        (b) [5 pts] 
         
        A measure is symmetrical if measure(A → B) = measure(B → A). Which of the measures 
        presented here are symmetrical? For each measure, please provide either a proof that the 
        measure is symmetrical, or a counterexample that shows the measure is not symmetrical. 
         
        (c) [5 pts] 
         A measure is desirable if its value is maximal for rules that hold 100% of the time (such rules are 
        called perfect implications). This makes it easy to identify the best rules. Which of the above 
        measures have this property? Explain why. 
         
         
        Product Recommendations: The action or practice of selling additional products or services to 
        existing customers is called cross-selling. Giving product recommendation is one of the 
        examples of cross-selling that are frequently used by online retailers. One simple method to 
        give product recommendations is to recommend products that are frequently browsed 
        together by the customers. 
         
        Suppose we want to recommend new products to the customer based on the products they 
        have already browsed on the online website. Write a program using the A-priori algorithm to 
        find products which are frequently browsed together. Fix the support to s = 100 (i.e. product 
        pairs need to occur together at least 100 times to be considered frequent) and find itemsets of 
        size 2 and 3. 
         
        Use the provided browsing behavior dataset browsing.txt. Each line represents a browsing 
        session of a customer. On each line, each string of 8 characters represents the id of an item 
        browsed during that session. The items are separated by spaces. 
         
        Note: for the following questions (d) and (e), the writeup will require a specific rule ordering 
        but the program need not sort the output. 
         
        (d) [10pts] 
         
        Identify pairs of items (X, Y) such that the support of {X, Y} is at least 100. For all such pairs, 
        compute the confidence scores of the corresponding association rules: X ⇒ Y, Y ⇒ X. Sort the 
        rules in decreasing order of confidence scores and list the top 5 rules in the writeup. Break ties, 
        if any, by lexicographically increasing order on the left hand side of the rule. 
         
        (e) [10pts] 
         
        Identify item triples (X, Y, Z) such that the support of {X, Y, Z} is at least 100. For all such triples, 
        compute the confidence scores of the corresponding association rules: (X, Y) ⇒ Z, (X, Z) ⇒ Y, 
        and (Y, Z) ⇒ X. Sort the rules in decreasing order of confidence scores and list the top 5 rules in 
        the writeup. Order the left-hand-side pair lexicographically and break ties, if any, by 
        lexicographical order of the first then the second item in the pair. 
         
        What to submit: 
         
        Include your properly named code file (e.g., question_2.java or question_2.py), and include the 
        answers to the following questions in your writeup: 
         (i) Explanation for 2(a). 
         
        (ii) Proofs and/or counterexamples for 2(b). 
         
        (iii) Explanation for 2(c). 
         
        (iv) Top 5 rules with confidence scores for 2(d). 
         
        (v) Top 5 rules with confidence scores for 2(e). 
         
        3. Locality-Sensitive Hashing (30 pts) 
         
        When simulating a random permutation of rows, as described in Sec 3.3.5 of MMDS textbook, 
        we could save a lot of time if we restricted our attention to a randomly chosen k of the n rows, 
        rather than hashing all the row numbers. The downside of doing so is that if none of the k rows 
        contains a 1 in a certain column, then the result of the min-hashing is “don’t know,” i.e., we get 
        no row number as a min-hash value. It would be a mistake to assume that two columns that 
        both min-hash to “don’t know” are likely to be similar. However, if the probability of getting 
        “don’t know” as a min-hash value is small, we can tolerate the situation, and simply ignore such 
        min-hash values when computing the fraction of min-hashes in which two columns agree. 
         
        (a) [10 pts] 
         
        Suppose a column has m 1’s and therefore (n-m) 0’s. Prove that the probability we get 
        “don’t know” as the min-hash value for this column is at most (
        +,-
        + ).. 
         
        (b) [10 pts] 
         
        Suppose we want the probability of “don’t know” to be at most  ,/0. Assuming n and m are 
        both very large (but n is much larger than m or k), give a simple approximation to the smallest 
        value of k that will assure this probability is at most  ,/0. Hints: (1) You can use (
        +,-
        + ). as the 
        exact value of the probability of “don’t know.” (2) Remember that for large x, (1 − /
        1
        )1 ≈ 1/ . 
         
        (c) [10 pts] 
         
        Note: This question should be considered separate from the previous two parts, in that we are 
        no longer restricting our attention to a randomly chosen subset of the rows. 
         When min-hashing, one might expect that we could estimate the Jaccard similarity without 
        using all possible permutations of rows. For example, we could only allow cyclic permutations 
        i.e., start at a randomly chosen row r, which becomes the first in the order, followed by rows 
        r+1, r+2, and so on, down to the last row, and then continuing with the first row, second row, 
        and so on, down to row r−1. There are only n such permutations if there are n rows. However, 
        these permutations are not sufficient to estimate the Jaccard similarity correctly. 
         
        Give an example of two columns such that the probability (over cyclic permutations only) that 
        their min-hash values agree is not the same as their Jaccard similarity. In your answer, please 
        provide (a) an example of a matrix with two columns (let the two columns correspond to sets 
        denoted by S1 and S2) (b) the Jaccard similarity of S1 and S2, and (c) the probability that a 
        random cyclic permutation yields the same min-hash value for both S1 and S2. 
         
        What to submit: 
         
        Include the following in your writeup: 
         
        (i) Proof for 3(a) 
         
        (ii) Derivation and final answer for 3(b) 
         
        (iii) Example for 3(c) 
         
        請(qǐng)加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp




         

        掃一掃在手機(jī)打開當(dāng)前頁
      1. 上一篇:FINM8006代寫、代做Python編程設(shè)計(jì)
      2. 下一篇:&#160;ICT50220代做、代寫c++,Java程序設(shè)計(jì)
      3. 無相關(guān)信息
        合肥生活資訊

        合肥圖文信息
        挖掘機(jī)濾芯提升發(fā)動(dòng)機(jī)性能
        挖掘機(jī)濾芯提升發(fā)動(dòng)機(jī)性能
        戴納斯帝壁掛爐全國售后服務(wù)電話24小時(shí)官網(wǎng)400(全國服務(wù)熱線)
        戴納斯帝壁掛爐全國售后服務(wù)電話24小時(shí)官網(wǎng)
        菲斯曼壁掛爐全國統(tǒng)一400售后維修服務(wù)電話24小時(shí)服務(wù)熱線
        菲斯曼壁掛爐全國統(tǒng)一400售后維修服務(wù)電話2
        美的熱水器售后服務(wù)技術(shù)咨詢電話全國24小時(shí)客服熱線
        美的熱水器售后服務(wù)技術(shù)咨詢電話全國24小時(shí)
        海信羅馬假日洗衣機(jī)亮相AWE  復(fù)古美學(xué)與現(xiàn)代科技完美結(jié)合
        海信羅馬假日洗衣機(jī)亮相AWE 復(fù)古美學(xué)與現(xiàn)代
        合肥機(jī)場巴士4號(hào)線
        合肥機(jī)場巴士4號(hào)線
        合肥機(jī)場巴士3號(hào)線
        合肥機(jī)場巴士3號(hào)線
        合肥機(jī)場巴士2號(hào)線
        合肥機(jī)場巴士2號(hào)線
      4. 幣安app官網(wǎng)下載 短信驗(yàn)證碼

        關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

        Copyright © 2024 hfw.cc Inc. All Rights Reserved. 合肥網(wǎng) 版權(quán)所有
        ICP備06013414號(hào)-3 公安備 42010502001045

        主站蜘蛛池模板: 精品国产一区二区三区香蕉| 黄桃AV无码免费一区二区三区 | 狠狠综合久久AV一区二区三区| 亚洲乱码国产一区三区| 制服丝袜一区二区三区| 亚洲视频在线一区二区| 国产福利酱国产一区二区| 亚洲av午夜福利精品一区| 久久精品一区二区三区中文字幕| 国内精自品线一区91| 亚洲码欧美码一区二区三区 | 97久久精品无码一区二区天美 | 国产精品成人一区无码| 欧美日本精品一区二区三区 | 中文字幕一区二区视频| 爆乳熟妇一区二区三区霸乳| 国产精品一区二区电影| 无码中文字幕人妻在线一区二区三区| 无码囯产精品一区二区免费| 国产在线一区二区在线视频| 久久婷婷色综合一区二区| 一区二区三区www| 久久久99精品一区二区| 久久精品中文字幕一区| 国产探花在线精品一区二区| 亚洲一区二区三区AV无码| 亚洲线精品一区二区三区 | 中文字幕一区视频一线| 国产在线精品一区二区在线观看 | 国产成人久久精品区一区二区| 精品亚洲A∨无码一区二区三区| 3d动漫精品一区视频在线观看| 亚洲va乱码一区二区三区| 97久久精品无码一区二区| 亚洲国产AV一区二区三区四区| 正在播放国产一区| 中文字幕在线观看一区| 无码少妇一区二区性色AV | 无码人妻一区二区三区精品视频| V一区无码内射国产| 精品日韩在线视频一区二区三区|