Bu, belki de köşe yazarı Jon Bentley'den Donald Knuth'tan bir dosyada en sık k kelimeleri bulabilecek bir program yazmasını istediğinde 1986'da bazı rezonans alan klasik kodlama zorluklarından biridir. Knuth , okuryazar programlama tekniğini göstermek için 8 sayfalık bir programda karma denemeleri kullanarak hızlı bir çözüm uyguladı . Bell Laboratuarlarından Douglas McIlroy, Knuth'un çözümünü İncil'in tam bir metnini bile işleyemediğini eleştirdi ve tek bir astarla cevap verdi, bu kadar hızlı değil, ama işi hallediyor:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q

1987'de, bu kez bir Princeton profesörü tarafından başka bir çözümle bir takip makalesi yayınlandı. Ama aynı zamanda tek bir Kutsal Kitap için de sonuç döndüremedi!

Sorun Açıklaması

Orijinal sorun açıklaması:

Bir metin dosyası ve bir tamsayı k verildiğinde, dosyadaki en yaygın sözcükleri (ve bunların oluşum sayısını) azalan sıklıkta yazdırırsınız.

Ek sorun açıklamaları:

Knuth bir kelimeyi Latin harfleri dizesi olarak tanımladı: [A-Za-z]+
diğer tüm karakterler yok sayılır
büyük ve küçük harfler eşdeğer kabul edilir ( WoRd== word)
dosya boyutu veya kelime uzunluğu üzerinde sınır yok
birbirini izleyen kelimeler arasındaki mesafeler keyfi olarak büyük olabilir
en hızlı program, en az toplam CPU zamanı kullanan programdır (çoklu kullanım muhtemelen yardımcı olmaz)

Örnek test senaryoları

Test 1: James Joyce tarafından Ulysses 64 kez bitirildi (96 MB dosya).

Ulysses Project Gutenberg'den indirin :wget http://www.gutenberg.org/files/4300/4300-0.txt
64 kez bitiştir: for i in {1..64}; do cat 4300-0.txt >> ulysses64; done
En sık kullanılan kelime 968832 görünüşü ile “the” dir.

Test 2: Özel olarak oluşturulmuş rastgele metin giganovel(yaklaşık 1 GB).

Python 3 jeneratör betiği burada .
Metin, doğal dillere benzer şekilde görünen 148391 farklı kelime içeriyor.
En sık kullanılan kelimeler: “e” (11309 görünüm) ve “ihit” (11290 görünüm).

Genellik testi: keyfi olarak büyük boşlukları olan keyfi büyük kelimeler.

Referans uygulamaları

Bu sorun için Rosetta Kodunu inceledikten ve birçok uygulamanın inanılmaz derecede yavaş olduğunu (kabuk komut dosyasından daha yavaş!) Fark ettikten sonra , burada birkaç iyi uygulamayı test ettim . ulysses64Zaman karmaşıklığı ile birlikte performans aşağıdadır :

                                     ulysses64      Time complexity
C++ (prefix trie + heap)             4.145          O((N + k) log k)
Python (Counter)                     10.547         O(N + k log Q)
AWK + sort                           20.606         O(N + Q log Q)
McIlroy (tr + sort + uniq)           43.554         O(N log N)

Bunu yenebilir misin?

Test yapmak

Performans, standart Unix timekomutuyla ("kullanıcı" zamanı) 2017 13 "MacBook Pro kullanılarak değerlendirilecektir . Mümkünse, lütfen modern derleyicileri kullanın (örneğin, eski sürümü değil, en son Haskell sürümünü kullanın).

Şimdiye kadar sıralamalar

Referans programları da dahil olmak üzere zamanlamalar:

                                              k=10                  k=100K
                                     ulysses64      giganovel      giganovel
C++ (trie) by ShreevatsaR            0.671          4.227          4.276
C (trie + bins) by Moogie            0.704          9.568          9.459
C (trie + list) by Moogie            0.767          6.051          82.306
C++ (hash trie) by ShreevatsaR       0.788          5.283          5.390
C (trie + sorted list) by Moogie     0.804          7.076          x
Rust (trie) by Anders Kaseorg        0.842          6.932          7.503
J by miles                           1.273          22.365         22.637
C# (trie) by recursive               3.722          25.378         24.771
C++ (trie + heap)                    4.145          42.631         72.138
APL (Dyalog Unicode) by Adám         7.680          x              x
Python (dict) by movatica            9.387          99.118         100.859
Python (Counter)                     10.547         102.822        103.930
Ruby (tally) by daniero              15.139         171.095        171.551
AWK + sort                           20.606         213.366        222.782
McIlroy (tr + sort + uniq)           43.554         715.602        750.420

Kümülatif sıralama * (%, mümkün olan en iyi puan - 300):

#     Program                         Score  Generality
 1  C++ (trie) by ShreevatsaR           300     Yes
 2  C++ (hash trie) by ShreevatsaR      368      x
 3  Rust (trie) by Anders Kaseorg       465     Yes
 4  C (trie + bins) by Moogie           552      x
 5  J by miles                         1248     Yes
 6  C# (trie) by recursive             1734      x
 7  C (trie + list) by Moogie          2182      x
 8  C++ (trie + heap)                  3313      x
 9  Python (dict) by movatica          6103     Yes
10  Python (Counter)                   6435     Yes
11  Ruby (tally) by daniero           10316     Yes
12  AWK + sort                        13329     Yes
13  McIlroy (tr + sort + uniq)        40970     Yes

* Her üç testteki en iyi programlara göre zaman performansının toplamı.

Şimdiye kadarki en iyi program: burada (ikinci çözüm)

fastest-code

— Andriy Makukha
kaynak

Skor sadece Ulysses zamanı? Zımni gibi görünüyor ama açıkça söylenmiyor

— Post Rock Garf Hunter

@ SriotchilismO'Zaic, şimdilik, evet. Ancak ilk test senaryosuna güvenmemelisiniz çünkü daha büyük test vakaları gelebilir. ulysses64, tekrarlayan olmanın bariz dezavantajına sahiptir: Dosyanın 1 / 64'ünden sonra yeni kelimeler görünmez. Yani, çok iyi bir test durumu değil, dağıtımı (veya çoğaltması) kolaydır.

— Andriy Makukha

3

Daha önce bahsettiğiniz gizli test vakalarını kastediyordum. Gerçek metinleri ortaya çıkardığınızda hash'leri şimdi yayınlarsanız, cevaplara adil ve kral yapmadığınızdan emin olabiliriz. Her ne kadar Ulysses için olan karma biraz yararlı olsa da.

— Rock Garf Hunter Post

1

@tsh Bu benim anlayışım: örneğin iki kelime e ve g olurdu

— Moogie

1

@AndriyMakukha Ah, teşekkürler. Bunlar sadece böceklerdi; Onları düzelttim.

— Anders Kaseorg

5

[C]

Aşağıdakiler, 2.8 Ghz Xeon W3530'umda Test 1 için 1.6 saniyenin altında çalışıyor. Windows 7'de MinGW.org GCC-6.3.0-1 kullanılarak oluşturuldu:

Girdi olarak iki argüman alır (metin dosyasının yolu ve listelenecek en sık kullanılan k sayısı için)

Kelimelerin harfleri üzerinde dallanan bir ağaç yaratır, daha sonra yaprak harflerinde bir sayacı arttırır. Ardından, geçerli yaprak sayacının en sık kullanılan sözcükler listesindeki en küçük sözcükten daha büyük olup olmadığını denetler. (liste boyutu komut satırı bağımsız değişkeni ile belirlenen sayıdır) Öyleyse, yaprak harfiyle gösterilen sözcüğü en sık kullanılan sözcüklerden biri olacak şekilde yükseltin. Bu, daha fazla harf okunana kadar tekrarlanır. Bundan sonra en sık kullanılan sözcüklerin listesi, en sık kullanılan sözcükler listesinden en sık kullanılan sözcük için verimsiz bir yinelemeli arama yoluyla çıkarılır.

Şu anda varsayılan olarak işlem süresini çıkarır, ancak diğer gönderilerle tutarlılık sağlamak için kaynak kodundaki ZAMANLAMA tanımını devre dışı bırakın.

Ayrıca, bunu bir iş bilgisayarından gönderdim ve Test 2 metnini indiremedim. Bu Test 2 ile değişiklik yapılmadan çalışmalıdır, ancak MAX_LETTER_INSTANCES değerinin artırılması gerekebilir.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// increase this if needing to output more top frequent words
#define MAX_TOP_FREQUENT_WORDS 1000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char mostFrequentWord;
    struct Letter* parent;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0 || k> MAX_TOP_FREQUENT_WORDS)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n");
        printf("NOTE: upto %d most frequent words can be requested\n\n",MAX_TOP_FREQUENT_WORDS);
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;
    root->mostFrequentWord = false;
    root->count = 0;

    // the next letter to be processed
    Letter* nextLetter = null;

    // store of the top most frequent words
    Letter* topWords[MAX_TOP_FREQUENT_WORDS];

    // initialise the top most frequent words
    for (i = 0; i<k; i++)
    {
        topWords[i]=root;
    }

    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // ignore this word if already identified as a most frequent word
            if (!currentLetter->mostFrequentWord)
            {
                // update the list of most frequent words
                // by replacing the most infrequent top word if this word is more frequent
                if (currentLetter->count> lowestWordCount)
                {
                    currentLetter->mostFrequentWord = true;
                    topWords[lowestWordIndex]->mostFrequentWord = false;
                    topWords[lowestWordIndex] = currentLetter;
                    lowestWordCount = currentLetter->count;

                    // update the index and count of the next most infrequent top word
                    for (i=0;i<k; i++)
                    {
                        // if the topword  is root then it can immediately be replaced by this current word, otherwise test
                        // whether the top word is less than the lowest word count
                        if (topWords[i]==root || topWords[i]->count<lowestWordCount)
                        {
                            lowestWordCount = topWords[i]->count;
                            lowestWordIndex = i;
                        }
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    while (k > 0 )
    {
        highestWordCount = 0;
        string[0]=0;
        tmp[0]=0;

        // find next most frequent word
        for (i=0;i<k; i++)
        {
            if (topWords[i]->count>highestWordCount)
            {
                highestWordCount = topWords[i]->count;
                highestWordIndex = i;
            }
        }

        Letter* letter = topWords[highestWordIndex];

        // swap the end top word with the found word and decrement the number of top words
        topWords[highestWordIndex] = topWords[--k];

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

Test 1 ve sıkça kullanılan ilk 10 kelime için ve zamanlama etkinken şunu yazdırmalıdır:

 968832 the
 528960 of
 466432 and
 421184 a
 322624 to
 320512 in
 270528 he
 213120 his
 191808 i
 182144 s

 Time Taken: 1.549155 seconds

— Moogie
kaynak

Etkileyici! Sözde listenin kullanılması en kötü durumda O (Nk) yapar, bu nedenle k = 100.000 ile giganovel için referans C ++ programından daha yavaş çalışır. Ama k << N için bu açık bir galip.

— Andriy Makukha

1

@AndriyMakukha Teşekkürler! Böylesine basit bir uygulamanın büyük bir hız sağladığı için biraz şaşırdım. Listeyi sıralayarak daha büyük k değerleri için daha iyi yapabilirdim. (sıralama, liste sırası yavaşça değişeceğinden çok pahalı olmamalıdır), ancak karmaşıklık getirir ve k değerinin düşük değerleri için hızı büyük olasılıkla etkiler.

— Denemek

evet, ben de şaşırdım. Bunun nedeni, referans programının çok sayıda işlev çağrısı kullanması ve derleyicinin düzgün şekilde optimize etmemesi olabilir.

— Andriy Makukha

Başka bir performans avantajı muhtemelen lettersdizinin yarı statik tahsisinden gelirken, referans uygulaması ağaç düğümlerini dinamik olarak tahsis eder.

— Andriy Makukha

mmap-ing olmalıdır hızlı (~ benim linux dizüstü% 5): #include<sys/mman.h>, <sys/stat.h>, <fcntl.h>, okuma dosyanın yerini int d=open(argv[1],0);struct stat s;fstat(d,&s);dataLength=s.st_size;data=mmap(0,dataLength,1,1,d,0);ve yorum dışarıfree(data);

— ngn

4

Pas, paslanma

Bilgisayarımda bu, giganovel 100000'i Moogie'nin C "önek ağacı + kutular" C çözümünden yaklaşık% 42 daha hızlı (10.64 s'ye karşı 18.24 s) çalıştırıyor. Ayrıca önceden tanımlanmış bir sınırı yoktur (kelime uzunluğu, benzersiz kelimeler, tekrarlanan kelimeler, vb. Üzerindeki sınırları önceden tanımlayan C çözümünün aksine).

`src/main.rs`

use memmap::MmapOptions;
use pdqselect::select_by_key;
use std::cmp::Reverse;
use std::default::Default;
use std::env::args;
use std::fs::File;
use std::io::{self, Write};
use typed_arena::Arena;

#[derive(Default)]
struct Trie<'a> {
    nodes: [Option<&'a mut Trie<'a>>; 26],
    count: u64,
}

fn main() -> io::Result<()> {
    // Parse arguments
    let mut args = args();
    args.next().unwrap();
    let filename = args.next().unwrap();
    let size = args.next().unwrap().parse().unwrap();

    // Open input
    let file = File::open(filename)?;
    let mmap = unsafe { MmapOptions::new().map(&file)? };

    // Build trie
    let arena = Arena::new();
    let mut num_words = 0;
    let mut root = Trie::default();
    {
        let mut node = &mut root;
        for byte in &mmap[..] {
            let letter = (byte | 32).wrapping_sub(b'a');
            if let Some(child) = node.nodes.get_mut(letter as usize) {
                node = child.get_or_insert_with(|| {
                    num_words += 1;
                    arena.alloc(Default::default())
                });
            } else {
                node.count += 1;
                node = &mut root;
            }
        }
        node.count += 1;
    }

    // Extract all counts
    let mut index = 0;
    let mut counts = Vec::with_capacity(num_words);
    let mut stack = vec![root.nodes.iter()];
    'a: while let Some(frame) = stack.last_mut() {
        while let Some(child) = frame.next() {
            if let Some(child) = child {
                if child.count != 0 {
                    counts.push((child.count, index));
                    index += 1;
                }
                stack.push(child.nodes.iter());
                continue 'a;
            }
        }
        stack.pop();
    }

    // Find frequent counts
    select_by_key(&mut counts, size, |&(count, _)| Reverse(count));
    // Or, in nightly Rust:
    //counts.partition_at_index_by_key(size, |&(count, _)| Reverse(count));

    // Extract frequent words
    let size = size.min(counts.len());
    counts[0..size].sort_by_key(|&(_, index)| index);
    let mut out = Vec::with_capacity(size);
    let mut it = counts[0..size].iter();
    if let Some(mut next) = it.next() {
        index = 0;
        stack.push(root.nodes.iter());
        let mut word = vec![b'a' - 1];
        'b: while let Some(frame) = stack.last_mut() {
            while let Some(child) = frame.next() {
                *word.last_mut().unwrap() += 1;
                if let Some(child) = child {
                    if child.count != 0 {
                        if index == next.1 {
                            out.push((word.to_vec(), next.0));
                            if let Some(next1) = it.next() {
                                next = next1;
                            } else {
                                break 'b;
                            }
                        }
                        index += 1;
                    }
                    stack.push(child.nodes.iter());
                    word.push(b'a' - 1);
                    continue 'b;
                }
            }
            stack.pop();
            word.pop();
        }
    }
    out.sort_by_key(|&(_, count)| Reverse(count));

    // Print results
    let stdout = io::stdout();
    let mut stdout = io::BufWriter::new(stdout.lock());
    for (word, count) in out {
        stdout.write_all(&word)?;
        writeln!(stdout, " {}", count)?;
    }

    Ok(())
}

`Cargo.toml`

[package]
name = "frequent"
version = "0.1.0"
authors = ["Anders Kaseorg <andersk@mit.edu>"]
edition = "2018"

[dependencies]
memmap = "0.7.0"
typed-arena = "1.4.1"
pdqselect = "0.1.0"

[profile.release]
lto = true
opt-level = 3

kullanım

cargo build --release
time target/release/frequent ulysses64 10

— Anders Kaseorg
kaynak

1

Süper! Her üç ayarda da çok iyi performans. Kelimenin tam anlamıyla Carol Nichols'ın Rust hakkındaki son konuşmasını izlemenin tam ortasındaydım :) Biraz alışılmadık bir sözdizimi, ancak dili öğrenmek için heyecanlıyım: C ++ sonrası sistem dillerinden çıkan tek dil gibi görünüyor geliştiricinin hayatını daha kolay hale getirirken çok performanstan ödün verin.

— Andriy Makukha

Çok hızlı! etkilendim! Acaba C (ağaç + çöp kutusu) için daha iyi derleyici seçeneği benzer bir sonuç verecek mi?

— Moogie

@Moogie Ben zaten seninle test yapıyordum -O3ve -Ofastölçülebilir bir fark yaratmıyordum.

— Anders Kaseorg

@Moogie, kodunu şöyle derliyordum gcc -O3 -march=native -mtune=native program.c.

— Andriy Makukha

@Andriy Makukha ah. Bu, elde ettiğim sonuçlar ile sonuçlarım arasındaki büyük hız farkını açıklayacaktı: zaten optimizasyon bayrakları uyguluyordunuz. Birçok büyük kod optimizasyonu kaldığını sanmıyorum. Mingw'in bir uygulaması olmadığı için haritayı başkaları tarafından önerildiği gibi test edemiyorum ... Ve sadece% 5'lik bir artış verecekti. Sanırım Anders'in müthiş girişine boyun eğmek zorunda kalacağım. Aferin!

— Moogie

3

APL (Dyalog Unicode)

Aşağıdakiler, Windows 10'da 64 bit Dyalog APL 17.0 kullanarak 2.6 Ghz i7-4720HQ cihazımda 8 saniyenin altında çalışıyor:

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞

Önce dosya adını, sonra k'yi ister. Çalışma süresinin önemli bir bölümünün (yaklaşık 1 saniye) yalnızca dosyayı okuduğunu unutmayın.

Zamanlamak için aşağıdakileri dyalogyürütülebilir programınıza bağlayabilmeniz gerekir (en sık kullanılan on kelime için):

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞
/tmp/ulysses64
10
⎕OFF

Yazdırmalıdır:

 the  968832
 of   528960
 and  466432
 a    421184
 to   322624
 in   320512
 he   270528
 his  213120
 i    191808
 s    182144

— Adem
kaynak

Çok hoş! Python'u yener. En iyi sonucu verdi export MAXWS=4096M. Sanırım, karma tablolar kullanıyor? Çünkü çalışma alanı boyutunu 2 GB'a düşürmek 2 saniye daha yavaşlatır.

— Andriy Makukha

@AndriyMakukha Evet ∊göre bir karma tablo kullanan bu ve eminim ⌸çok içten yapar.

— Adám

Neden O (N log N)? Python (k kez tüm benzersiz kelimelerin yığınını geri yükleme) veya AWK (sadece benzersiz kelimeleri sıralama) çözümü gibi görünüyor. McIlroy'un kabuk komut dosyasında olduğu gibi tüm kelimeleri sıralamazsanız, O (N log N) olmamalıdır.

— Andriy Makukha

@AndriyMakukha Tüm sayıları derecelendirir . İşte performans adamımızın yazdığı şey: Karma tablolar hakkında teorik olarak şüpheli şeylere inanmadığınız sürece zaman karmaşıklığı O (N log N), bu durumda O (N).

— Adám

Kodunuzu 8, 16 ve 32 Ulysses'e karşı çalıştırdığımda, tam olarak doğrusal olarak yavaşlar. Belki de performans görevlinizin karma tabloların zaman karmaşıklığına ilişkin görüşlerini yeniden düşünmesi gerekir :) Ayrıca, bu kod daha büyük test senaryosu için çalışmaz. WS FULLÇalışma alanını 6 GB'a yükseltmeme rağmen geri dönüyor .

— Andriy Makukha

2

[C] Önek Ağacı + Kutular

NOT: Kullanılan derleyicinin program yürütme hızı üzerinde önemli bir etkisi vardır! Gcc kullandım (MinGW.org GCC-8.2.0-3) 8.2.0. -Ofast anahtarınıkullanırken,program normal derlenmiş programdan neredeyse% 50 daha hızlı çalışır.

Algoritma Karmaşıklığı

O zamandan beri gerçekleştirdiğim Çöplük sınıflamanın bir çeşit Pigeonhost türünün farkına vardım, bu da bu çözümün Big O karmaşıklığını yok edebileceğim anlamına geliyor.

Ben olmak için hesaplamak:

Worst Time complexity: O(1 + N + k)
Worst Space complexity: O(26*M + N + n) = O(M + N + n)

Where N is the number of words of the data
and M is the number of letters of the data
and n is the range of pigeon holes
and k is the desired number of sorted words to return
and N<=M

Ağaç yapım karmaşıklığı, ağaç geçişine eşdeğerdir, bu nedenle herhangi bir seviyede geçilmesi gereken doğru düğüm O (1) 'dir (her harf doğrudan bir düğüme eşlendiğinden ve her harf için her zaman yalnızca bir ağaç seviyesinden geçeriz)

Güvercin Delik sıralaması O (N + n) 'dir, burada n anahtar değer aralığıdır, ancak bu sorun için tüm değerleri sıralamamız gerekmez, sadece k sayısıdır, bu nedenle en kötü durum O (N + k) olur.

Bir araya getirildiğinde O (1 + N + k) elde edilir.

Ağaç yapımı için Uzay Karmaşıklığı, verilerin M harfleri olan bir kelimeden oluşması ve her düğümün 26 düğüme (yani alfabenin harfleri için) sahip olması durumunda en kötü durumun 26 * M düğümleri olmasıdır. Böylece O (26 * M) = O (M)

Güvercin Delik sıralaması için O (N + n) uzay karmaşıklığı vardır

Bir araya getirildiğinde O (26 * M + N + n) = O (M + N + n) elde edilir.

Algoritma

Girdi olarak iki argüman alır (metin dosyasının yolu ve listelenecek en sık kullanılan k sayısı için)

Diğer girişlerime dayanarak, bu sürüm diğer çözümlerime kıyasla artan k değerleri ile çok küçük bir zaman maliyet rampasına sahiptir. Ancak düşük k değerleri için fark edilir derecede yavaştır, ancak daha büyük k değerleri için çok daha hızlı olmalıdır.

Kelimelerin harfleri üzerinde dallanan bir ağaç yaratır, daha sonra yaprak harflerinde bir sayacı arttırır. Daha sonra kelimeyi aynı boyuttaki bir kelime grubuna ekler (kelimeyi zaten içinde bulunduğu kutudan kaldırdıktan sonra). Bu, daha fazla harf okunana kadar tekrarlanır. Bundan sonra, bölmeler en büyük bölmeden başlayarak ters çevrilir ve her bölmenin kelimeleri çıkarılır.

Şu anda varsayılan olarak işlem süresini çıkarır, ancak diğer gönderilerle tutarlılık sağlamak için kaynak kodundaki ZAMANLAMA tanımını devre dışı bırakın.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// may need to increase if the source text has many repeated words
#define MAX_BINS 1000000

// assume maximum of 20 letters in a word... adjust accordingly
#define MAX_LETTERS_IN_A_WORD 20

// assume maximum of 10 letters for the string representation of the bin number... adjust accordingly
#define MAX_LETTERS_FOR_BIN_NAME 10

// maximum number of bytes of the output results
#define MAX_OUTPUT_SIZE 10000000

#define false 0
#define true 1
#define null 0
#define SPACE_ASCII_CODE 32

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    //char isAWord;
    struct Letter* parent;
    struct Letter* binElementNext;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

struct Bin
{
  struct Letter* word;
};
typedef struct Bin Bin;


int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i, j;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], null, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the memory for bins
    Bin* bins = (Bin*) malloc(sizeof(Bin) * MAX_BINS);
    memset(&bins[0], null, sizeof( Bin) * MAX_BINS);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;
    Letter *nextFreeLetter = &letters[0];

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;

    unsigned int sortedListSize = 0;

    // the count of the most frequent word
    unsigned int maxCount = 0;

    // the count of the current word
    unsigned int wordCount = 0;

////////////////////////////////////////////////////////////////////////////////////////////
// CREATING PREFIX TREE
    j=dataLength;
    while (--j>0)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                ++letterMasterIndex;
                nextLetter = ++nextFreeLetter;
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        else
        {
            //currentLetter->isAWord = true;

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

////////////////////////////////////////////////////////////////////////////////////////////
// ADDING TO BINS

    j = letterMasterIndex;
    currentLetter=&letters[j-1];
    while (--j>0)
    {

      // is the letter the leaf letter of word?
      if (currentLetter->count>0)
      {
        i = currentLetter->count;
        if (maxCount < i) maxCount = i;

        // add to bin
        currentLetter->binElementNext = bins[i].word;
        bins[i].word = currentLetter;
      }
      --currentLetter;
    }

////////////////////////////////////////////////////////////////////////////////////////////
// PRINTING OUTPUT

    // the memory for output
    char* output = (char*) malloc(sizeof(char) * MAX_OUTPUT_SIZE);
    memset(&output[0], SPACE_ASCII_CODE, sizeof( char) * MAX_OUTPUT_SIZE);
    unsigned int outputIndex = 0;

    // string representation of the current bin number
    char binName[MAX_LETTERS_FOR_BIN_NAME];
    memset(&binName[0], SPACE_ASCII_CODE, MAX_LETTERS_FOR_BIN_NAME);


    Letter* letter;
    Letter* binElement;

    // starting at the bin representing the most frequent word(s) and then iterating backwards...
    for ( i=maxCount;i>0 && k>0;i--)
    {
      // check to ensure that the bin has at least one word
      if ((binElement = bins[i].word) != null)
      {
        // update the bin name
        sprintf(binName,"%u",i);

        // iterate of the words in the bin
        while (binElement !=null && k>0)
        {
          // stop if we have reached the desired number of outputed words
          if (k-- > 0)
          {
              letter = binElement;

              // add the bin name to the output
              memcpy(&output[outputIndex],&binName[0],MAX_LETTERS_FOR_BIN_NAME);
              outputIndex+=MAX_LETTERS_FOR_BIN_NAME;

              // construct string of letters to form the word
               while (letter != root)
              {
                // output the letter to the output
                output[outputIndex++] = letter->asciiCode+97;
                letter=letter->parent;
              }

              output[outputIndex++] = '\n';

              // go to the next word in the bin
              binElement = binElement->binElementNext;
          }
        }
      }
    }

    // write the output to std out
    fwrite(output, 1, outputIndex, stdout);
   // fflush(stdout);

   // free( data );
   // free( letters );
   // free( bins );
   // free( output );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

EDIT: şimdi ağaç inşa edilene kadar nüfus kutularını erteliyor ve çıktı inşaat optimize.

EDIT2: şimdi hız optimizasyonu için dizi erişimi yerine işaretçi aritmetiği kullanılıyor.

— Moogie
kaynak

Vaov! 11 saniyede 1 GB'lık bir dosyadan 100.000 en sık kullanılan kelime ... Bu bir tür sihir numarası gibi görünüyor.

— Andriy Makukha

Hile yok ... Sadece verimli bellek kullanımı için CPU zamanı ticareti. Sonucuna şaşırdım ... Eski bilgisayarımda 60 saniyeden fazla sürüyor. II'nin gereksiz karşılaştırmalar yaptığımı ve dosya işlenene kadar binning'i erteleyebileceğini fark ettim. Daha da hızlı yapmalı. Yakında deneyeceğim ve cevabımı güncelleyeceğim.

— Moogie

@AndriyMakukha Artık tüm sözcükler işlenene ve ağaç inşa edilene kadar Kutuları doldurmayı erteledim. Bu gereksiz karşılaştırmaları ve çöp kutusu manipülasyonunu önler. Baskının önemli ölçüde zaman aldığını tespit ettiğim için çıktının oluşturulma şeklini de değiştirdim!

— Moogie

Makinemde bu güncelleme belirgin bir fark yaratmıyor. Ancak, bir ulysses64zamanlar çok hızlı performans gösterdi , bu yüzden şu anda mevcut bir lider.

— Andriy Makukha

Bilgisayarımla benzersiz bir sorun olmalı :) Bu yeni çıkış algoritmasını kullanırken 5 saniyelik bir hız fark ettim

— Moogie

2

J

9!:37 ] 0 _ _ _

'input k' =: _2 {. ARGV
k =: ". k

lower =: a. {~ 97 + i. 26
words =: ((lower , ' ') {~ lower i. ]) (32&OR)&.(a.&i.) fread input
words =: ' ' , words
words =: -.&(s: a:) s: words
uniq =: ~. words
res =: (k <. # uniq) {. \:~ (# , {.)/.~ uniq&i. words
echo@(,&": ' ' , [: }.@": {&uniq)/"1 res

exit 0

İle komut dosyası olarak çalıştır jconsole <script> <input> <k>. Örneğin, çıkış giganovelile k=100K:

$ time jconsole solve.ijs giganovel 100000 | head 
11309 e
11290 ihit
11285 ah
11260 ist
11255 aa
11202 aiv
11201 al
11188 an
11187 o
11186 ansa

real    0m13.765s
user    0m11.872s
sys     0m1.786s

Kullanılabilir sistem belleği miktarı dışında bir sınır yoktur.

— mil
kaynak

Daha küçük test çantası için çok hızlı! Güzel! Ancak, keyfi olarak büyük sözcükler için çıktıdaki sözcükleri kısaltır. Bir kelimedeki karakter sayısında bir sınır olup olmadığından veya sadece çıktıyı daha özlü yapmaktan emin değilim.

— Andriy Makukha

@AndriyMakukha Evet, ...hat başına çıkış kesme nedeniyle oluşur. Tüm kesmeyi devre dışı bırakmak için başlangıçta bir satır ekledim. Daha benzersiz kelimeler olduğundan daha fazla bellek kullandığı için giganovel'de yavaşlar.

— mil

Harika! Şimdi genellik testini geçiyor. Ve makinemde yavaşlamadı. Aslında, küçük bir hızlanma oldu.

— Andriy Makukha

2

C ++ (a la Knuth)

Knuth'un programının nasıl çalışacağını merak ettim, bu yüzden onun (aslında Pascal) programını C ++ 'a çevirdim.

Knuth'un ana hedefi hız değil, WEB okuryazar programlama sistemini göstermek için olsa da, program şaşırtıcı derecede rekabetçi ve şimdiye kadar buradaki cevaplardan daha hızlı bir çözüme yol açıyor. Benim programımın çevirisi (WEB programının ilgili "bölüm" numaraları " {§24}" gibi yorumlarda belirtilmiştir ):

#include <iostream>
#include <cassert>

// Adjust these parameters based on input size.
const int TRIE_SIZE = 800 * 1000; // Size of the hash table used for the trie.
const int ALPHA = 494441;  // An integer that's approximately (0.61803 * TRIE_SIZE), and relatively prime to T = TRIE_SIZE - 52.
const int kTolerance = TRIE_SIZE / 100;  // How many places to try, to find a new place for a "family" (=bunch of children).

typedef int32_t Pointer;  // [0..TRIE_SIZE), an index into the array of Nodes
typedef int8_t Char;  // We only care about 1..26 (plus two values), but there's no "int5_t".
typedef int32_t Count;  // The number of times a word has been encountered.
// These are 4 separate arrays in Knuth's implementation.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Pointer sibling;  // Previous sibling, cyclically. (From smallest child to header, and header to largest child.)
  Count count;  // The number of times this word has been encountered.
  Char ch;  // EMPTY, or 1..26, or HEADER. (For nodes with ch=EMPTY, the link/sibling/count fields mean nothing.)
} node[TRIE_SIZE + 1];
// Special values for `ch`: EMPTY (free, can insert child there) and HEADER (start of family).
const Char EMPTY = 0, HEADER = 27;

const Pointer T = TRIE_SIZE - 52;
Pointer x;  // The `n`th time we need a node, we'll start trying at x_n = (alpha * n) mod T. This holds current `x_n`.
// A header can only be in T (=TRIE_SIZE-52) positions namely [27..TRIE_SIZE-26].
// This transforms a "h" from range [0..T) to the above range namely [27..T+27).
Pointer rerange(Pointer n) {
  n = (n % T) + 27;
  // assert(27 <= n && n <= TRIE_SIZE - 26);
  return n;
}

// Convert trie node to string, by walking up the trie.
std::string word_for(Pointer p) {
  std::string word;
  while (p != 0) {
    Char c = node[p].ch;  // assert(1 <= c && c <= 26);
    word = static_cast<char>('a' - 1 + c) + word;
    // assert(node[p - c].ch == HEADER);
    p = (p - c) ? node[p - c].link : 0;
  }
  return word;
}

// Increment `x`, and declare `h` (the first position to try) and `last_h` (the last position to try). {§24}
#define PREPARE_X_H_LAST_H x = (x + ALPHA) % T; Pointer h = rerange(x); Pointer last_h = rerange(x + kTolerance);
// Increment `h`, being careful to account for `last_h` and wraparound. {§25}
#define INCR_H { if (h == last_h) { std::cerr << "Hit tolerance limit unfortunately" << std::endl; exit(1); } h = (h == TRIE_SIZE - 26) ? 27 : h + 1; }

// `p` has no children. Create `p`s family of children, with only child `c`. {§27}
Pointer create_child(Pointer p, int8_t c) {
  // Find `h` such that there's room for both header and child c.
  PREPARE_X_H_LAST_H;
  while (!(node[h].ch == EMPTY and node[h + c].ch == EMPTY)) INCR_H;
  // Now create the family, with header at h and child at h + c.
  node[h]     = {.link = p, .sibling = h + c, .count = 0, .ch = HEADER};
  node[h + c] = {.link = 0, .sibling = h,     .count = 0, .ch = c};
  node[p].link = h;
  return h + c;
}

// Move `p`'s family of children to a place where child `c` will also fit. {§29}
void move_family_for(const Pointer p, Char c) {
  // Part 1: Find such a place: need room for `c` and also all existing children. {§31}
  PREPARE_X_H_LAST_H;
  while (true) {
    INCR_H;
    if (node[h + c].ch != EMPTY) continue;
    Pointer r = node[p].link;
    int delta = h - r;  // We'd like to move each child by `delta`
    while (node[r + delta].ch == EMPTY and node[r].sibling != node[p].link) {
      r = node[r].sibling;
    }
    if (node[r + delta].ch == EMPTY) break;  // There's now space for everyone.
  }

  // Part 2: Now actually move the whole family to start at the new `h`.
  Pointer r = node[p].link;
  int delta = h - r;
  do {
    Pointer sibling = node[r].sibling;
    // Move node from current position (r) to new position (r + delta), and free up old position (r).
    node[r + delta] = {.ch = node[r].ch, .count = node[r].count, .link = node[r].link, .sibling = node[r].sibling + delta};
    if (node[r].link != 0) node[node[r].link].link = r + delta;
    node[r].ch = EMPTY;
    r = sibling;
  } while (node[r].ch != EMPTY);
}

// Advance `p` to its `c`th child. If necessary, add the child, or even move `p`'s family. {§21}
Pointer find_child(Pointer p, Char c) {
  // assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // If `p` currently has *no* children.
  Pointer q = node[p].link + c;
  if (node[q].ch == c) return q;  // Easiest case: `p` already has a `c`th child.
  // Make sure we have room to insert a `c`th child for `p`, by moving its family if necessary.
  if (node[q].ch != EMPTY) {
    move_family_for(p, c);
    q = node[p].link + c;
  }
  // Insert child `c` into `p`'s family of children (at `q`), with correct siblings. {§28}
  Pointer h = node[p].link;
  while (node[h].sibling > q) h = node[h].sibling;
  node[q] = {.ch = c, .count = 0, .link = 0, .sibling = node[h].sibling};
  node[h].sibling = q;
  return q;
}

// Largest descendant. {§18}
Pointer last_suffix(Pointer p) {
  while (node[p].link != 0) p = node[node[p].link].sibling;
  return p;
}

// The largest count beyond which we'll put all words in the same (last) bucket.
// We do an insertion sort (potentially slow) in last bucket, so increase this if the program takes a long time to walk trie.
const int MAX_BUCKET = 10000;
Pointer sorted[MAX_BUCKET + 1];  // The head of each list.

// Records the count `n` of `p`, by inserting `p` in the list that starts at `sorted[n]`.
// Overwrites the value of node[p].sibling (uses the field to mean its successor in the `sorted` list).
void record_count(Pointer p) {
  // assert(node[p].ch != HEADER);
  // assert(node[p].ch != EMPTY);
  Count f = node[p].count;
  if (f == 0) return;
  if (f < MAX_BUCKET) {
    // Insert at head of list.
    node[p].sibling = sorted[f];
    sorted[f] = p;
  } else {
    Pointer r = sorted[MAX_BUCKET];
    if (node[p].count >= node[r].count) {
      // Insert at head of list
      node[p].sibling = r;
      sorted[MAX_BUCKET] = p;
    } else {
      // Find right place by count. This step can be SLOW if there are too many words with count >= MAX_BUCKET
      while (node[p].count < node[node[r].sibling].count) r = node[r].sibling;
      node[p].sibling = node[r].sibling;
      node[r].sibling = p;
    }
  }
}

// Walk the trie, going over all words in reverse-alphabetical order. {§37}
// Calls "record_count" for each word found.
void walk_trie() {
  // assert(node[0].ch == HEADER);
  Pointer p = node[0].sibling;
  while (p != 0) {
    Pointer q = node[p].sibling;  // Saving this, as `record_count(p)` will overwrite it.
    record_count(p);
    // Move down to last descendant of `q` if any, else up to parent of `q`.
    p = (node[q].ch == HEADER) ? node[q].link : last_suffix(q);
  }
}

int main(int, char** argv) {
  // Program startup
  std::ios::sync_with_stdio(false);

  // Set initial values {§19}
  for (Char i = 1; i <= 26; ++i) node[i] = {.ch = i, .count = 0, .link = 0, .sibling = i - 1};
  node[0] = {.ch = HEADER, .count = 0, .link = 0, .sibling = 26};

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0L, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  if (fptr) fclose(fptr);

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (int i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  node[0].count = 0;

  walk_trie();

  const int max_words_to_print = atoi(argv[2]);
  int num_printed = 0;
  for (Count f = MAX_BUCKET; f >= 0 && num_printed <= max_words_to_print; --f) {
    for (Pointer p = sorted[f]; p != 0 && num_printed < max_words_to_print; p = node[p].sibling) {
      std::cout << word_for(p) << " " << node[p].count << std::endl;
      ++num_printed;
    }
  }

  return 0;
}

Knuth'un programından farklılıklar:

Ben Knuth'un 4 diziler kombine link, sibling, countve chbir bir diziye struct Node(daha kolay bu şekilde anlamak bulabilirsiniz).
Bölümlerin okuryazar programlama (WEB stili) metinsel dönüşümünü daha geleneksel işlev çağrılarına (ve birkaç makroya) değiştirdim.
Biz bu yüzden kullanarak, standart Pascal garip I / O kuralları / kısıtlamaları kullanmak gerekmez freadve data[i] | 32 - 'a'bunun yerine Pascal Geçici çözümün, burada diğer yanıtlar olarak.
Program çalışırken limitleri aşmamız durumunda (alan tükenirse), Knuth'un orijinal programı daha sonraki kelimeleri bırakarak ve sonunda bir mesaj yazdırarak bunu incelikle ele alır. (McIlroy'un "Knuth'un çözümünü İncil'in tam bir metnini bile işleyemediğini" eleştirdiğini söylemek doğru değil); sadece bazen "İsa" gibi bir metinde sıkça sık rastlanan kelimelerin çok geç olabileceğine işaret ediyordu. "İncil'de, bu yüzden hata durumu zararsız değildir.) Programı basitçe sonlandırmak için daha gürültülü (ve daha kolay) bir yaklaşım benimsedim.
Program, kullandığım bellek kullanımını kontrol etmek için sabit bir TRIE_SIZE bildiriyor. (Orijinal gereksinimler için 32767 sabiti seçildi - "bir kullanıcı yirmi sayfalık bir teknik makalede (kabaca 50 bin baytlık bir dosyada) en sık kullanılan 100 kelimeyi bulabilmelidir" ve Pascal menzilli tamsayı ile iyi başa çıktığı için Test girişi artık 20 milyon kat daha büyük olduğu için 25x'i 800.000'e çıkarmamız gerekiyordu.)
Dizelerin son baskısı için, sadece trie yürüyebilir ve aptal (muhtemelen ikinci dereceden) bir dize eki yapabiliriz.

Bunun dışında, bu tam olarak Knuth'un programıdır (hash trie / paketli trie veri yapısı ve kova sıralaması kullanılarak) ve girişteki tüm karakterler arasında döngü yaparken hemen hemen aynı işlemleri yapar (Knuth's Pascal programının yaptığı gibi); harici algoritma veya veri yapısı kitaplığı kullanmadığını ve eşit frekanstaki kelimelerin alfabetik sırada yazdırılacağını unutmayın.

Zamanlama

İle derlendi

clang++ -std=c++17 -O2 ptrie-walktrie.cc

Buradaki en büyük test çantasında ( giganovel100.000 kelime talep edildiğinde) ve şimdiye kadar burada yayınlanan en hızlı programla karşılaştırıldığında, biraz ama tutarlı bir şekilde daha hızlı buluyorum:

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]

(Üst satır Anders Kaseorg'un Rust çözümüdür; alt kısım yukarıdaki programdır. Bunlar ortalama, min, maks, medyan ve çeyreklerle 100 çalışmadan zamanlamalardır.)

analiz

Bu neden daha hızlı? C ++, Rust'dan daha hızlı değildir ya da Knuth'un programı mümkün olan en hızlı değildir - aslında, Knuth'un programı, üç paketlemeden dolayı (hafızayı korumak için) eklemelerde daha yavaştır. Şüphelendiğim neden, Knuth'un 2008'de şikayet ettiği bir şeyle ilgili :

64-Bit İşaretçiler Hakkında Bir Alev

4 gigabayttan daha az RAM kullanan bir program derlediğimde 64 bit işaretçiler kullanmak kesinlikle aptalca. Bu tür işaretçi değerleri bir yapının içinde göründüğünde, sadece belleğin yarısını boşa harcamakla kalmaz, aynı zamanda önbelleğin yarısını etkili bir şekilde atarlar.

Yukarıdaki program 32-bit dizi indekslerini (64-bit işaretçileri değil) kullanır, bu nedenle "Düğüm" yapısı daha az bellek kaplar, bu nedenle yığınta daha fazla Düğüm ve daha az önbellek kaçışı olur. (Aslında, orada bazı çalışmalar olarak bu konuda x32 ABI , ancak gibi görünüyor iyi bir durumda değil fikri örneğin bkz açıkçası kullanışlı olmasına rağmen son duyuru ve V8 ibre sıkıştırma . Neyse.) Yani üzerinde giganovel, bu program (paketli) trie için 12.8 MB, Rust programının trie (açık giganovel) için 32.18MB'ını kullanır . Biz 1000x ("giganovel" den "teranovel" demek) ölçeklendirmek ve hala 32-bit indeksleri aşmak olabilir, bu yüzden bu makul bir seçim gibi görünüyor.

Daha hızlı değişken

Hız için optimize edebilir ve ambalajdan vazgeçebiliriz, böylece (çözümlenmemiş) üçgeni Rust çözümünde olduğu gibi, işaretçiler yerine dizinlerle kullanabiliriz. Bu, daha hızlı olan ve farklı kelime, karakter vb. Sayısında önceden sabit bir sınırı olmayan bir şey verir :

#include <iostream>
#include <cassert>
#include <vector>
#include <algorithm>

typedef int32_t Pointer;  // [0..node.size()), an index into the array of Nodes
typedef int32_t Count;
typedef int8_t Char;  // We'll usually just have 1 to 26.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Count count;  // The number of times this word has been encountered. Undefined for header nodes.
};
std::vector<Node> node; // Our "arena" for Node allocation.

std::string word_for(Pointer p) {
  std::vector<char> drow;  // The word backwards
  while (p != 0) {
    Char c = p % 27;
    drow.push_back('a' - 1 + c);
    p = (p - c) ? node[p - c].link : 0;
  }
  return std::string(drow.rbegin(), drow.rend());
}

// `p` has no children. Create `p`s family of children, with only child `c`.
Pointer create_child(Pointer p, Char c) {
  Pointer h = node.size();
  node.resize(node.size() + 27);
  node[h] = {.link = p, .count = -1};
  node[p].link = h;
  return h + c;
}

// Advance `p` to its `c`th child. If necessary, add the child.
Pointer find_child(Pointer p, Char c) {
  assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // Case 1: `p` currently has *no* children.
  return node[p].link + c;  // Case 2 (easiest case): Already have the child c.
}

int main(int, char** argv) {
  auto start_c = std::clock();

  // Program startup
  std::ios::sync_with_stdio(false);

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  fclose(fptr);

  node.reserve(dataLength / 600);  // Heuristic based on test data. OK to be wrong.
  node.push_back({0, 0});
  for (Char i = 1; i <= 26; ++i) node.push_back({0, 0});

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (long i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  ++node[p].count;
  node[0].count = 0;

  // Brute-force: Accumulate all words and their counts, then sort by frequency and print.
  std::vector<std::pair<int, std::string>> counts_words;
  for (Pointer i = 1; i < static_cast<Pointer>(node.size()); ++i) {
    int count = node[i].count;
    if (count == 0 || i % 27 == 0) continue;
    counts_words.push_back({count, word_for(i)});
  }
  auto cmp = [](auto x, auto y) {
    if (x.first != y.first) return x.first > y.first;
    return x.second < y.second;
  };
  std::sort(counts_words.begin(), counts_words.end(), cmp);
  const int max_words_to_print = std::min<int>(counts_words.size(), atoi(argv[2]));
  for (int i = 0; i < max_words_to_print; ++i) {
    auto [count, word] = counts_words[i];
    std::cout << word << " " << count << std::endl;
  }

  return 0;
}

Bu program, buradaki çözümlerden daha fazla sıralama için çok değerli bir şey yapmasına rağmen, giganovelkendi süresi için sadece 12.2MB kullanır ve daha hızlı olmayı başarır. Bu programın zamanlamaları (son satır), belirtilen önceki zamanlamalarla karşılaştırıldığında:

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]
itrie-nolimit:             3.907 ±   0.127 [ 3.69.. 4.23]        [... 3.81 ...   3.9 ...   4.0...]

Rust'a çevrildiyse bunun (ya da karma programın) ne istediğini görmek isterim . :-)

Daha fazla ayrıntı

Burada kullanılan veri yapısı hakkında: "paketleme" denemelerinin açıklaması, TAOCP Cilt 3'teki Bölüm 6.3 (Dijital Arama, yani denemeler) 4. Alıştırmada ve ayrıca Knuth öğrencisi Frank Liang'ın TeX tireleme tezinde açık bir şekilde verilmiştir. : Com-put-er tarafından yazılmış Hy-fen-a-Word .
Bentley'nin sütunları, Knuth'un programı ve McIlroy'un incelemesi (sadece küçük bir kısmı Unix felsefesi ile ilgili) bağlamı, önceki ve sonraki sütunlar ve Knuth'un derleyiciler, TAOCP ve TeX dahil önceki deneyimleri ışığında daha net .
Programlama Stili'nde , bu programa farklı yaklaşımlar gösteren, Alıştırmalar kitabının tamamı vardır .

Yukarıdaki noktalara dikkat çeken bitmemiş bir blog yayınım var; tamamlandığında bu yanıtı düzenleyebilir. Bu arada, bu cevabı zaten Knuth'un doğum günü vesilesiyle (10 Ocak) burada yayınlamak. :-)

— ShreevatsaR
kaynak

Müthiş! Nihayet sadece birisi Knuth'un çözümünü (bunu yapmayı amaçladım, ancak Pascal'da) önceki en iyi kayıtların bazılarını yenen harika bir analiz ve performansla değil, aynı zamanda başka bir C ++ programıyla hız için yeni bir rekor kırdı! Olağanüstü.

— Andriy Makukha

Sahip olduğum tek iki yorum: 1) ikinci programınız şu anda Segmentation fault: 11keyfi olarak büyük kelimeler ve boşluklar olan test durumları için başarısız oluyor ; 2) McIlroy'un "eleştirisini" sempati duyduğumu hissedebilse de, Knuth'un niyetinin sadece okuryazar programlama tekniğini göstermek olduğunu çok iyi biliyorum, McIlroy ise mühendislik perspektifinden eleştirdi. McIlroy daha sonra bunun adil bir şey olmadığını itiraf etti.

— Andriy Makukha

@AndriyMakukha Hata! Bu özyinelemeydi word_for; Şimdi düzeltti. Evet McIlroy, Unix borularının mucidi olarak, Unix'in küçük araçlar oluşturma felsefesini evangeleme fırsatını yakaladı . Bu, Knuth'un sinir bozucu (programlarını okumaya çalışıyorsanız) monolitik yaklaşımına kıyasla iyi bir felsefe, ancak bağlamda biraz başka bir nedenden dolayı da haksızdı: bugün Unix yolu yaygın olarak mevcuttu, ancak 1986'da sınırlıydı Bell Labs, Berkeley, vb ( "onun firması işinde iyi prefabrik yapar") için

— ShreevatsaR

İşler! Yeni krala tebrikler :-P Unix ve Knuth'a gelince, sistemi çok fazla sevmiyordu, çünkü farklı araçlar arasında bir birlik vardı ve çok az birliktelik vardı (örneğin, birçok araç normal ifadeleri farklı tanımlamaktadır).

— Andriy Makukha

1

Python 3

Basit bir sözlük içeren bu uygulama, Countersistemimde kullanılandan biraz daha hızlıdır .

def words_from_file(filename):
    import re

    pattern = re.compile('[a-z]+')

    for line in open(filename):
        yield from pattern.findall(line.lower())


def freq(textfile, k):
    frequencies = {}

    for word in words_from_file(textfile):
        frequencies[word] = frequencies.get(word, 0) + 1

    most_frequent = sorted(frequencies.items(), key=lambda item: item[1], reverse=True)

    for i, (word, frequency) in enumerate(most_frequent):
        if i == k:
            break

        yield word, frequency


from time import time

start = time()
print('\n'.join('{}:\t{}'.format(f, w) for w,f in freq('giganovel', 10)))
end = time()
print(end - start)

— movatica
kaynak

1

Sistemimde yalnızca giganovel ile test yapabildim ve oldukça uzun sürüyor (~ 90 saniye). Gutenbergproject yasal nedenlerle Almanya'da engellendi ...

— movatica

İlginç. Ya yönteme heapqherhangi bir performans katmaz Counter.most_commonya enumerate(sorted(...))da heapqdahili olarak kullanır .

— Andriy Makukha

Python 2 ile test ettim ve performans benzerdi, bu yüzden sanırım sıralama sıralama kadar hızlı çalışıyor Counter.most_common.

— Andriy Makukha

Evet, belki de sadece sistemimde titriyordu ... En azýndan daha yavaţ deđil :) Ama normal ifade aramasý karakterler üzerinde yinelemekten çok daha hýzlý. Oldukça yüksek performans sergiliyor gibi görünüyor.

— movatica

1

[C] Önek Ağacı + Sıralı Bağlantılı Liste

Girdi olarak iki argüman alır (metin dosyasının yolu ve listelenecek en sık kullanılan k sayısı için)

Diğer girişime dayanarak, bu sürüm daha büyük k değerleri için çok daha hızlıdır, ancak daha düşük k değerlerinde küçük bir performans maliyetiyle.

Kelimelerin harfleri üzerinde dallanan bir ağaç yaratır, daha sonra yaprak harflerinde bir sayacı arttırır. Ardından, geçerli yaprak sayacının en sık kullanılan sözcükler listesindeki en küçük sözcükten daha büyük olup olmadığını denetler. (liste boyutu komut satırı bağımsız değişkeni ile belirlenen sayıdır) Öyleyse, yaprak harfiyle gösterilen sözcüğü en sık kullanılan sözcüklerden biri olacak şekilde yükseltin. Zaten en sık kullanılan bir sözcükse, sözcük sayısı şimdi daha yüksekse, bir sonraki en sık kullanılanla değiştirin, böylece listeyi sıralayın. Bu, daha fazla harf okunana kadar tekrarlanır. Bundan sonra en sık kullanılan sözcüklerin listesi çıkarılır.

Şu anda varsayılan olarak işlem süresini çıkarır, ancak diğer gönderilerle tutarlılık sağlamak için kaynak kodundaki ZAMANLAMA tanımını devre dışı bırakın.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char isTopWord;
    struct Letter* parent;
    struct Letter* higher;
    struct Letter* lower;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;
    Letter* sortedWordsStart = null;
    Letter* sortedWordsEnd = null;
    Letter* A;
    Letter* B;
    Letter* C;
    Letter* D;

    unsigned int sortedListSize = 0;


    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // is this word not in the top word list?
            if (!currentLetter->isTopWord)
            {
                // first word becomes the sorted list
                if (sortedWordsStart == null)
                {
                  sortedWordsStart = currentLetter;
                  sortedWordsEnd = currentLetter;
                  currentLetter->isTopWord = true;
                  ++sortedListSize;
                }
                // always add words until list is at desired size, or 
                // swap the current word with the end of the sorted word list if current word count is larger
                else if (sortedListSize < k || currentLetter->count> sortedWordsEnd->count)
                {
                    // replace sortedWordsEnd entry with current word
                    if (sortedListSize == k)
                    {
                      currentLetter->higher = sortedWordsEnd->higher;
                      currentLetter->higher->lower = currentLetter;
                      sortedWordsEnd->isTopWord = false;
                    }
                    // add current word to the sorted list as the sortedWordsEnd entry
                    else
                    {
                      ++sortedListSize;
                      sortedWordsEnd->lower = currentLetter;
                      currentLetter->higher = sortedWordsEnd;
                    }

                    currentLetter->lower = null;
                    sortedWordsEnd = currentLetter;
                    currentLetter->isTopWord = true;
                }
            }
            // word is in top list
            else
            {
                // check to see whether the current word count is greater than the supposedly next highest word in the list
                // we ignore the word that is sortedWordsStart (i.e. most frequent)
                while (currentLetter != sortedWordsStart && currentLetter->count> currentLetter->higher->count)
                {
                    B = currentLetter->higher;
                    C = currentLetter;
                    A = B != null ? currentLetter->higher->higher : null;
                    D = currentLetter->lower;

                    if (A !=null) A->lower = C;
                    if (D !=null) D->higher = B;
                    B->higher = C;
                    C->higher = A;
                    B->lower = D;
                    C->lower = B;

                    if (B == sortedWordsStart)
                    {
                      sortedWordsStart = C;
                    }

                    if (C == sortedWordsEnd)
                    {
                      sortedWordsEnd = B;
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    Letter* letter;
    while (sortedWordsStart != null )
    {
        letter = sortedWordsStart;
        highestWordCount = letter->count;
        string[0]=0;
        tmp[0]=0;

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
        sortedWordsStart = sortedWordsStart->lower;
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

— Moogie
kaynak

Çok k = 100.000 çıkışını sıralanmaz döndürür: 12 eroilk 111 iennoa 10 yttelen 110 engyt.

— Andriy Makukha

Sanırım sebeple ilgili bir fikrim var. Benim düşüncem, geçerli kelimenin bir sonraki en yüksek kelimesinin olup olmadığını kontrol ederken listedeki takas kelimelerini yinelemem gerekeceği.

— Zamanım olduğunda

hmm iyi görünüyor eğer bir if için değiştirmenin basit bir düzeltme işe yarıyor, ancak aynı zamanda daha büyük k değerleri için algoritmayı önemli ölçüde yavaşlatır. Daha akıllı bir çözüm düşünmek zorunda kalabilirim.

— Moogie

1

C #

Bu, en son .net SDK'ları ile çalışmalıdır .

using System;
using System.IO;
using System.Diagnostics;
using System.Collections.Generic;
using System.Linq;
using static System.Console;

class Node {
    public Node Parent;
    public Node[] Nodes;
    public int Index;
    public int Count;

    public static readonly List<Node> AllNodes = new List<Node>();

    public Node(Node parent, int index) {
        this.Parent = parent;
        this.Index = index;
        AllNodes.Add(this);
    }

    public Node Traverse(uint u) {
        int b = (int)u;
        if (this.Nodes is null) {
            this.Nodes = new Node[26];
            return this.Nodes[b] = new Node(this, b);
        }
        if (this.Nodes[b] is null) return this.Nodes[b] = new Node(this, b);
        return this.Nodes[b];
    }

    public string GetWord() => this.Index >= 0 
        ? this.Parent.GetWord() + (char)(this.Index + 97)
        : "";
}

class Freq {
    const int DefaultBufferSize = 0x10000;

    public static void Main(string[] args) {
        var sw = Stopwatch.StartNew();

        if (args.Length < 2) {
            WriteLine("Usage: freq.exe {filename} {k} [{buffersize}]");
            return;
        }

        string file = args[0];
        int k = int.Parse(args[1]);
        int bufferSize = args.Length >= 3 ? int.Parse(args[2]) : DefaultBufferSize;

        Node root = new Node(null, -1) { Nodes = new Node[26] }, current = root;
        int b;
        uint u;

        using (var fr = new FileStream(file, FileMode.Open))
        using (var br = new BufferedStream(fr, bufferSize)) {
            outword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done; 
                    else goto outword;
                }
                else current = root.Traverse(u);
            inword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done;
                    ++current.Count;
                    goto outword;
                }
                else {
                    current = current.Traverse(u);
                    goto inword;
                }
            done:;
        }

        WriteLine(string.Join("\n", Node.AllNodes
            .OrderByDescending(count => count.Count)
            .Take(k)
            .Select(node => node.GetWord())));

        WriteLine("Self-measured milliseconds: {0}", sw.ElapsedMilliseconds);
    }
}

İşte bir örnek çıktı.

C:\dev\freq>csc -o -nologo freq-trie.cs && freq-trie.exe giganovel 100000
e
ihit
ah
ist
 [... omitted for sanity ...]
omaah
aanhele
okaistai
akaanio
Self-measured milliseconds: 13619

İlk başta, dize tuşları ile bir sözlük kullanmaya çalıştım, ama bu çok yavaştı. Bence .net dizeleri dahili olarak bu uygulama için bir tür savurgan olan 2 baytlık kodlama ile temsil edilmektedir. Sonra saf baytlara ve çirkin, goto tarzı bir devlet makinesine geçtim. Vaka dönüştürme bitsel bir operatördür. Karakter aralığı kontrolü, çıkarma işleminden sonra tek bir karşılaştırmayla yapılır. Çalışma zamanının% 0.1'inden daha azını kullandığını bulduğum için son sıralamayı optimize etmek için hiçbir çaba harcamadım.

Düzeltme: Algoritma esasen doğruydu, ancak kelimelerin tüm öneklerini sayarak toplam kelimeleri fazla bildiriyordu. Toplam sözcük sayısı sorunun bir gereği olmadığından, bu çıktıyı kaldırdım. Tüm k kelimelerinin çıktısını almak için çıktıyı da ayarladım. Sonunda string.Join()tüm listeyi bir kerede kullanmaya ve yazmaya karar verdim . Şaşırtıcı bir şekilde, her kelimeyi 100k için ayrı ayrı yazan makinemde bir saniye daha hızlı.

— özyinelemeli
kaynak

1

Çok etkileyici! Bitsel tolowerve tek karşılaştırma hilelerinizi seviyorum . Ancak, programınızın neden beklenenden daha farklı kelimeler rapor ettiğini anlamıyorum. Ayrıca, orijinal sorun açıklamasına göre, programın tüm k kelimelerini azalan sıklık sırasına göre çıkarması gerekir, bu yüzden programınızı 100.000 en sık kelime çıkarması gereken son teste saymadım.

— Andriy Makukha

@AndriyMakukha: Son sayımda hiç bulunmayan kelime öneklerini de saydığımı görebiliyorum. Tüm çıktıları yazmaktan kaçındım çünkü konsol çıktısı pencerelerde oldukça yavaş. Çıktıyı bir dosyaya yazabilir miyim?

— özyinelemeli

Sadece standart çıktıyı yazdırın, lütfen. K = 10 için herhangi bir makinede hızlı olmalıdır. Ayrıca çıktıyı bir komut satırından bir dosyaya yeniden yönlendirebilirsiniz. Bunun gibi .

— Andriy Makukha

@AndriyMakukha: Tüm problemleri çözdüğüme inanıyorum. Çok fazla çalışma zamanı maliyeti olmadan gerekli tüm çıktıyı üretmek için bir yol buldum.

— özyinelemeli

Bu çıktı çok hızlı! Çok hoş. Programınızı diğer çözümlerin yaptığı gibi frekans sayılarını da yazdıracak şekilde değiştirdim.

— Andriy Makukha

1

Ruby 2.7.0-önizleme1 ile `tally`

Ruby'nin son sürümünde yeni bir yöntem var tally. Gönderen sürüm notları :

Enumerable#tallyeklendi. Her öğenin oluşumunu sayar.

["a", "b", "c", "b"].tally
#=> {"a"=>1, "b"=>2, "c"=>1}

Bu bizim için neredeyse tüm görevi çözer. Sadece önce dosyayı okumalı ve daha sonra maksimumu bulmalıyız.

İşte her şey:

k = ARGV.shift.to_i

pp ARGF
  .each_line
  .lazy
  .flat_map { @1.scan(/[A-Za-z]+/).map(&:downcase) }
  .tally
  .max_by(k, &:last)

edit: kKomut satırı argümanı olarak eklendi

ruby k filename.rb input.txtRuby'nin 2.7.0-önizleme1 sürümü kullanılarak çalıştırılabilir . Bu, sürüm notları sayfasındaki çeşitli bağlantılardan indirilebilir veya rbenv kullanılarak kurulabilir rbenv install 2.7.0-dev.

Kendi dayak, eski bilgisayarımda çalıştırmak örnek:

$ time ruby bentley.rb 10 ulysses64 
[["the", 968832],
 ["of", 528960],
 ["and", 466432],
 ["a", 421184],
 ["to", 322624],
 ["in", 320512],
 ["he", 270528],
 ["his", 213120],
 ["i", 191808],
 ["s", 182144]]

real    0m17.884s
user    0m17.720s
sys 0m0.142s

— daniero
kaynak

1

Ruby'yi kaynaklardan yükledim. Kabaca makinenizdeki kadar hızlı çalışır (15 saniye vs 17).

— Andriy Makukha

Bentley kodlama zorluğu: k en sık kullanılan kelimeler

Sorun Açıklaması

Örnek test senaryoları

Referans uygulamaları

Test yapmak

Şimdiye kadar sıralamalar

[C]

Pas, paslanma

src/main.rs

Cargo.toml

kullanım

APL (Dyalog Unicode)

[C] Önek Ağacı + Kutular

J

C ++ (a la Knuth)

Zamanlama

analiz

Daha hızlı değişken

Daha fazla ayrıntı

Python 3

[C] Önek Ağacı + Sıralı Bağlantılı Liste

C #

Ruby 2.7.0-önizleme1 ile tally

`src/main.rs`

`Cargo.toml`

Ruby 2.7.0-önizleme1 ile `tally`